Jolt Alpha

Jolt is a zkVM framework built around the Lasso lookup argument.

Jolt powers succinct proofs of execution of programs written in any high-level language. Jolt's sumcheck-based SNARK makes extensive use of multivariate polynomials and commitment schemes. Jolt zkVMs have state-of-the-art prover performance and have substantial room for growth over the coming decades.

Jolt zkVMs have a simple programming model, requiring only 50-100 LOC to implement new VM instructions.

The Jolt codebase currently targets the RISC-V instruction set which is supported by most high-level language compilers, but the code is intended to be extensible and usable by any ISA.

The only property of the ISA that Jolt requires is that each primitive instruction is "decomposable". This means that evaluating the instruction on one or two 32-bit or 64-bit inputs can be done via a procedure of the following form:

  • decompose each input into, say, 8-bit chunks
  • apply one or more specified functions to these chunks (one from each input)
  • reconstruct the output of the original instruction from the outputs of the functions operating on the chunks

Background reading

Credits

Lasso was written by Srinath Setty, Justin Thaler and Riad Wahby. Jolt was written by Arasu Arun, Srinath Setty, and Justin Thaler.

Jolt was initially forked from Srinath Setty's work on microsoft/Spartan, specifically the arkworks-rs/Spartan fork in order to use the excellent Arkworks-rs prime field arithmetic library. Jolt's R1CS is also proven using a version of Spartan (forked from the microsoft/Spartan2 codebase) optimized to the case of uniform R1CS constraints.

Using Jolt

Using Jolt is simple thanks to our tooling and macros. To get started, head over to our quickstart guide. For more in depth information check out the guests and hosts section.

Quickstart

Installing

Start by installing the jolt command line tool.

cargo +nightly-2024-09-30 install --git https://github.com/a16z/jolt --force --bins jolt

Creating a Project

To create a project, use the jolt new command.

jolt new <PROJECT_NAME>

This will create a template Jolt project in the specified location. Make sure to cd into the project before running the next step.

cd <PROJECT_NAME>

Project Tour

The main folder contains the host, which is the code that can generate and verify proofs. Within this main folder, there is another package called guest which contains the Rust functions that we can prove from the host. For more information about these concepts refer to the guests and hosts section.

We'll start by taking a look at our guest. We can view the guest code in guest/src/lib.rs.


#![allow(unused)]
#![cfg_attr(feature = "guest", no_std)]
#![no_main]

fn main() {
#[jolt::provable]
fn fib(n: u32) -> u128 {
    let mut a: u128 = 0;
    let mut b: u128 = 1;
    let mut sum: u128;
    for _ in 1..n {
        sum = a + b;
        a = b;
        b = sum;
    }
    b
}
}

As we can see, this implements a simple Fibonacci function called fib. All we need to do to make our function provable is add the jolt::provable macro above it.

Next let's take a look at the host code in src/main.rs.

pub fn main() {
    let (prove_fib, verify_fib) = guest::build_fib();

    let (output, proof) = prove_fib(50);
    let is_valid = verify_fib(proof);

    println!("output: {}", output);
    println!("valid: {}", is_valid);
}

This section simply imports guest::build_fib which is automatically generated by the jolt::provable macro, and returns functions for proving and verifying our function. The prove function takes the same inputs as the original fib function, but modifies the outputs to additionally return a proof. The verify function can then be used to check this proof, and return a boolean indicating its validity.

Running

Let's now run the host with cargo.

cargo run --release

This will compile the guest, perform some required preprocessing, and execute the host code which proves and verifies the 50th Fibonacci number. This preprocessing is run within the build_fib function and adds significant time to running the host, but only needs to be performed once. This means that we could use the prove method many times without rerunning build_fib. In the future we will support caching this across runs of the host.

Installing

Start by installing the jolt command line tool.

cargo +nightly-2024-09-30 install --git https://github.com/a16z/jolt --force --bins jolt

Guests and Hosts

Guests and hosts are the main concepts that we need to understand in order to use Jolt. Guests contain the functions that Jolt will prove, while hosts are where we write more general code and invoke the prover and verifier for our guest functions.

Guests

Guests contain functions for Jolt to prove. Currently, these functions must be written using no_std Rust. If you have a function that does not require the standard library, making it provable is as easy as ensuring it is inside the guest package and adding the jolt::provable macro above it.

Let's take a look at a simple guest program to better understand it.


#![allow(unused)]
#![cfg_attr(feature = "guest", no_std)]
#![no_main]

fn main() {
#[jolt::provable]
fn add(x: u32, y: u32) -> u32 {
    x + y
}
}

As we can see, the guest looks like a normal no_std Rust library. The only major change is the addition of the jolt::provable macro, which lets Jolt know of the function's existence. Other than no_std, the only requirement of these functions is that its inputs are serializable and outputs are deserializable with serde. Fortunately serde is prevalent throughout the Rust ecosystem, so most types will support it by default.

There is no requirement that just a single function lives within the guest, and we are free to add as many as we need. Additionally, we can import any no_std compatible library just as we normally would in Rust.


#![allow(unused)]
#![cfg_attr(feature = "guest", no_std)]
#![no_main]

fn main() {
use sha2::{Sha256, Digest};
use sha3::{Keccak256, Digest};

#[jolt::provable]
fn sha2(input: &[u8]) -> [u8; 32] {
    let mut hasher = Sha256::new();
    hasher.update(input);
    let result = hasher.finalize();
    Into::<[u8; 32]>::into(result)
}

#[jolt::provable]
fn sha3(input: &[u8]) -> [u8; 32] {
    let mut hasher = Keccak256::new();
    hasher.update(input);
    let result = hasher.finalize();
    Into::<[u8; 32]>::into(result)
}
}

Hosts

Hosts are where we can invoke the Jolt prover to prove functions defined within the guest. Hosts do not have the no_std requirement, and are free to use the Rust standard library.

The host imports the guest package, and will have automatically generated functions to build each of the Jolt functions. For the sha2 and sha3 example guest we looked at in the guest section, these functions would be called build_sha2 and build_sha3 respectively. Each returns two results, a prover function and a verifier function. The prover function takes in the same input types as the original function and modifies the output to additionally include a proof. The verifier can then take this proof and verify it.

pub fn main() {
    let (prove_sha2, verify_sha2) = guest::build_sha2();
    let (prove_sha3, verify_sha3) = guest::build_sha3();

    let input = &[5u8; 32];

    let (output, proof) = prove_sha2(input);
    let is_valid = verify_sha2(proof);

    println!("sha2 output: {}", output);
    println!("sha2 valid: {}", is_valid);

    let (output, proof) = prove_sha3(input);
    let is_valid = verify_sha3(proof);

    println!("sha3 output: {}", output);
    println!("sha3 valid: {}", is_valid);
}

Allocators

Jolt provides an allocator which supports most containers such as Vec and Box. This is useful for Jolt users who would like to write no_std code rather than using Jolt's standard library support. To use these containers, they must be explicitly imported from alloc. The alloc crate is automatically provided by rust and does not need to added to the Cargo.toml file.

Example


#![allow(unused)]
#![cfg_attr(feature = "guest", no_std)]
#![no_main]

fn main() {
extern crate alloc;
use alloc::vec::Vec;

#[jolt::provable]
fn alloc(n: u32) -> u32 {
    let mut v = Vec::<u32>::new();
    for i in 0..100 {
        v.push(i);
    }

    v[n as usize]
}
}

Standard Library

Jolt supports the full Rust standard library. To enable support, simply add the guest-std feature to the Jolt import in the guest's Cargo.toml file and remove the #![cfg_attr(feature = "guest", no_std)] directive from the guest code.

Example


#![allow(unused)]
fn main() {
[package]
name = "guest"
version = "0.1.0"
edition = "2021"

[[bin]]
name = "guest"
path = "./src/lib.rs"

[features]
guest = []

[dependencies]
jolt = { package = "jolt-sdk", git = "https://github.com/a16z/jolt", features = ["guest-std"] }
}

#![allow(unused)]
#![no_main]

fn main() {
#[jolt::provable]
fn int_to_string(n: i32) -> String {
    n.to_string()
}
}

WASM Support

Jolt supports WebAssembly compatibility for the verification side of provable functions. This allows developers to run proof verification in WASM environments such as web browsers. Note that currently, only the verification process is supported in WASM; the proving process is not WASM-compatible at this time.

Creating a WASM-Compatible Project

To create a new project with WASM verification support, use the following command:

jolt new <PROJECT_NAME> --wasm

This creates a new project with the necessary structure and configuration for WASM-compatible verification.

Marking Functions for WASM Verification

For functions whose verification process should be WASM-compatible, extend the #[jolt::provable] macro with the wasm attribute:


#![allow(unused)]
fn main() {
#[jolt::provable(wasm)]
fn my_wasm_verifiable_function() {
    // function implementation
}
}

Building for WASM Verification

To compile your project for WASM verification, use the following command:

jolt build-wasm

This command performs several actions:

  1. It extracts all functions marked with #[jolt::provable(wasm) from your guest/src/lib.rs file.
  2. For each WASM-verifiable function, it preprocesses the function and saves the necessary verification data.
  3. It creates an index.html file as an example of how to use your WASM-compiled verification functions in a web environment.
  4. It uses wasm-pack to build your project, targeting web environments.

Important: The build process only compiles the verification functions for WASM. The proving process must still be performed outside of the WASM environment.

Adding Dependencies

When adding new dependencies for WASM-compatible projects, note that they must be added to both guest/Cargo.toml and the root Cargo.toml. The build-wasm process will automatically add necessary WASM-related dependencies to your project.

Switching Between WASM and Normal Compilation

The build process modifies the Cargo.toml files to enable WASM compilation. If you need to switch back to normal compilation, you may need to manually adjust these files. In the root Cargo.toml, you can remove the following lines:

[lib]
crate-type = ["cdylib"]
path = "guest/src/lib.rs"

Remember to restore these lines before attempting another WASM build.

Example Project Structure

After running jolt build-wasm, your project will include:

  • An index.html file in the root directory, providing a basic interface to verify proofs for each of your WASM-verifiable functions.
  • A pkg directory containing the WASM-compiled version of your project's verification functions.
  • Preprocessed data files for each WASM-verifiable function in the target/wasm32-unknown-unknown/release/ directory.

You can use this example as a starting point and customize it to fit your specific requirements.

Example: Modifying the Quickstart Project

You can use the example from the Quickstart chapter and modify the /src/main.rs file as follows:

use jolt::Serializable;
pub fn main() {
    let (prove_fib, _verify_fib) = guest::build_fib();

    let (_output, proof) = prove_fib(50);

    proof
        .save_to_file("proof.bin")
        .expect("Failed to save proof to file");
}

Remember to modify the Cargo.toml as described before for normal compilation, and then you can generate and save the proof with the following command:

cargo run -r

This will create a proof.bin file in the root directory. You can then verify this proof using the example index.html. Before doing this, change the crate type back to cdylib in the Cargo.toml and ensure that your /guest/src/lib.rs looks like this:


#![allow(unused)]
#![cfg_attr(feature = "guest", no_std)]
#![no_main]

fn main() {
#[jolt::provable(wasm)]
fn fib(n: u32) -> u128 {
    let mut a: u128 = 0;
    let mut b: u128 = 1;
    let mut sum: u128;
    for _ in 1..n {
        sum = a + b;
        a = b;
        b = sum;
    }

    b
}
}

Then run jolt build-wasm. After successful compilation, you can verify the proof using the index.html file. Start a local server from the project directory, for example with:

npx http-server .

Note: Make sure you have npx installed to use the http-server command.

Troubleshooting

Insufficient Memory or Stack Size

Jolt provides reasonable defaults for the total allocated memory and stack size. It is however possible that the defaults are not sufficient, leading to unpredictable errors within our tracer. To fix this we can try to increase these sizes. We suggest starting with the stack size first as this is much more likely to run out.

Below is an example of manually specifying both the total memory and stack size.


#![allow(unused)]
#![cfg_attr(feature = "guest", no_std)]
#![no_main]

fn main() {
extern crate alloc;
use alloc::vec::Vec;

#[jolt::provable(stack_size = 10000, memory_size = 10000000)]
fn waste_memory(size: u32, n: u32) {
    let mut v = Vec::new();
    for i in 0..size {
        vec.push(i);
    }
}
}

Maximum Input or Output Size Exceeded

Jolt restricts the size of the inputs and outputs to 4096 bytes by default. Using inputs and outputs that exceed this size will lead to errors. These values can be configured via the macro.


#![allow(unused)]
#![cfg_attr(feature = "guest", no_std)]
#![no_main]

fn main() {
#[jolt::provable(max_input_size = 10000, max_output_size = 10000)]
fn sum(input: &[u8]) -> u32 {
    let mut sum = 0;
    for value in input {
        sum += value;
    }

    sum
}
}

Guest Attempts to Compile Standard Library

Sometimes after installing the toolchain the guest still tries to compile with the standard library which will fail with a large number of errors that certain items such as Result are referenced and not available. This generally happens when one tries to run jolt before installing the toolchain. To address, try rerunning jolt install-toolchain, restarting your terminal, and delete both your rust target directory and any files under /tmp that begin with jolt.

Getting Help

If none of the above help, please serialize your program and send it along with a detailed bug report.

Serializing a call to the "fib" function in the Jolt guest:


#![allow(unused)]
fn main() {
// let (prove_fib, verify_fib) = guest::build_fib();
let program_summary = guest::analyze_fib(10);
program_summary.write_to_file("fib_10.txt".into()).expect("should write");
}

Contributors

How it works

Architecture overview

This section gives an overview of the core components of Jolt.

Jolt's four components

A VM does two things:

  • Repeatedly execute the fetch-decode-execute logic of its instruction set architecture.
  • Perform reads and writes to Random Access Memory (RAM).

The Jolt paper depicts these two tasks mapped to three components in the final Jolt proof:

Jolt Alpha

The Jolt codebase is similarly organized, but instead separates read-write memory (comprising registers and RAM) from program code (aka bytecode, which is read-only), for a total of four components:

fetch-decode-execute

Read-write memory

To handle reads/writes to RAM (and registers) Jolt uses a memory checking argument from Spice, which is closely related to Lasso itself. They are both based on "offline memory checking" techniques, the main difference being that Lasso supports read-only memory while Spice supports read-write memory, making it slightly more expensive. This is implemented in read_write_memory.rs.

For more details: Read-write memory

R1CS constraints

To handle the "fetch" part of the fetch-decode-execute loop, there is a minimal R1CS instance (about 60 constraints per cycle of the RISC-V VM). These constraints handle program counter (PC) updates and serves as the "glue" enforcing consistency between polynomials used in the components below. Jolt uses Spartan, optimized for the highly-structured nature of the constraint system (e.g., the R1CS constraint matrices are block-diagonal with blocks of size only about 60 x 80). This is implemented in jolt-core/src/r1cs.

For more details: R1CS constraints

Instruction lookups

To handle the "execute" part of the fetch-decode-execute loop, Jolt invokes the Lasso lookup argument. The lookup argument maps every instruction (including its operands) to its output. This is implemented in instruction_lookups.rs.

For more details: Instruction lookups

Bytecode

To handle the "decode" part of the fetch-decode-execute loop, Jolt uses another instance of (read-only) offline memory-checking, similar to Lasso. The bytecode of the guest program is "decoded" in preprocessing, and the prover subsequently invokes offline memory-checking on the sequence of reads from this decoded bytecode corresponding to the execution trace being proven. This is implemented in bytecode.rs.

For more details: Bytecode

Instruction lookups

In Jolt, the "execute" part of the "fetch-decode-execute" loop is handled using a lookup argument (Lasso). At a high level, the lookup queries are the instruction operands, and the lookup outputs are the instruction outputs.

Lasso

Lasso is a lookup argument (equivalent to a SNARK for reads into a read-only memory). Lookup arguments allow the prover to convince the verifier that for a committed vector of values , committed vector of indices , and lookup table , for all .

Lasso is a special lookup argument with highly desirable asymptotic costs largely correlated to the number of lookups (the length of the vectors and ), rather than the length of of the table .

A conversational background on lookups can be found here. In short: lookups are great for zkVMs as they allow constant cost / developer complexity for the prover algorithm per VM instruction.

A detailed engineering overview of Lasso (as a standalone lookup argument) can be found here.

Using Lasso to handle instruction execution

Lasso requires that each primitive instruction satisfies a decomposability property. The property needed is that the input(s) to the instruction can be broken into "chunks" (say, with each chunk consisting of 16 bits), such that one can obtain the answer to the original instruction by evaluating a simple function on each chunk and then "collating" the results together. For example, the bitwise-OR of two 32-bit inputs x and y can be computed by breaking each input up into 8-bit chunks, XORing each 8-bit chunk of x with the associated chunk of y, and concatenating the results together.

In Lasso, we call the task of evaluating a simple function on each chunk a "subtable lookup" (the relevant lookup table being the table storing all evaluations of the simple function). And the "collating" of the results of the subtable lookups into the result of the original lookup (instruction execution on the un-chunked inputs) is handled via an invocation of the sum-check protocol. We call this the "primary sumcheck" instance in Lasso.

The "primary sumcheck" collation looks as follows for a trace of length and a VM with unique instructions.

is an indicator function for the -th instruction: if instruction is used during the -th step of the trace, when .

is the collation function used by the -th instruction.

where is the number of independent memories used by an instruction. For simple instructions like the EQ instruction, , where is the number of chunks that the inputs are broken into, so . More complicated instructions such LT might have . The exact layout is dependent on the number of subtables required by the decomposition of the instruction. The mappings can be found in the JoltInstruction::subtable method implementations.

Mental Model

For a given (think integer index of the instruction within the trace), for all but one term of the outer sum. Similarly all for all but one term of the inner sum. Leaving just the collation function of a single instruction, evaluating to the collated lookup output of the single instruction. In reality is a random point selected by the verifier over the course of the protocol. The evaluation point provides a distance amplified encoding of the entire trace of instructions.

To illustrate more concretely imagine a two-instruction VM for LT and EQ instructions with .

Subtable Flags

  • We then use memory checking to determine that each of the memories is well formed
  • At a given step of the CPU only a single instruction will be used, that means that only that instruction's subtables will be used. For the rest of the memories we insert a no_op with (a, v) = 0. In order to make the GKR trees cheaper to compute and sumcheck we'll add a single additional layer to the GKR tree. During this layer we'll "toggle" each of the GKR leaves to "1" in the case that it is an unused step of the CPU. This will make the binary tree of multiplication gates cheaper. We'll toggle based on a new flags polynomial called which is the sum of all of the used in the instruction collation above.
  • The function to compute each of the leaves becomes

See the documentation on how Jolt leverages sparse-constraint-systems for further details on subtagle flags and how they are used in Jolt.

Read-write memory (registers and RAM)

Jolt proves the validity of registers and RAM using offline memory checking. In contrast to our usage of offline memory checking in other modules, registers and RAM are writable memory.

Memory layout

For the purpose of offline memory checking, Jolt treats registers, program inputs/outputs, and RAM as occupying one unified address space. This remapped address space is laid out as follows:

Memory layout

The zero-padding depicted above is sized so that RAM starts at a power-of-two offset (this is explained below). As noted in the diagram, the size of the witness scales with the highest memory addressed over the course of the program's execution. In addition to the zero-padding between the "Program I/O" and "RAM" sections, the end of the witness is zero-padded to a power of two.

Handling program I/O

Inputs

Program inputs and outputs (plus the panic and termination bits, which indicate whether the program has panicked or otherwise terminated, respectively) live in the same memory address space as RAM. Program inputs populate the designated input space upon initialization: init memory

The verifier can efficiently compute the MLE of this initial memory state on its own (i.e. in time proportional to the IO size, not the total memory size).

Ouputs and panic

On the other hand, the verifier cannot compute the MLE of the final memory state on its own –– though the program I/O is known to the verifier, the final memory state contains values written to registers/RAM over the course of program execution, which are not known to the verifier.

The verifier is, however, able to compute the MLE of the program I/O values (padded on both sides with zeros) –– this is denoted v_io below. If the prover is honest, then the final memory state (v_final below) should agree with v_io at the indices corresponding to program I/O.

final memory

To enforce this, we invoke the sumcheck protocol to perform a "zero-check" on the difference between v_final and v_io:

final memory

This also motivates the zero-padding between the "Program I/O" and "RAM" sections of the witness. The zero-padding ensures that both input_start and ram_witness_offset are powers of two, which makes it easier for the verifier to compute the MLEs of v_init and v_io.

Timestamp range check

Registers and RAM are writable memory, which introduces additional requirements compared to offline memory checking in a read-only context.

The multiset equality check for read-only memory, typically expressed as , is not adequate for ensuring the accuracy of read values. It is essential to also verify that each read operation retrieves a value that was written in a previous step (not a future step). (Here, denotes the tuples capturing initialization of memory and the tuples capturing all of the writes to memory following initialization. denotes the tuples capturing all read operations, and denotes tuples capturing a final read pass over all memory cells).

To formalize this, we assert that the timestamp of each read operation, denoted as , must not exceed the global timestamp at that particular step. The global timestamp starts at 0 and is incremented once per step.

The verification of is equivalent to confirming that falls within the range and that the difference is also within the same range.

The process of ensuring that both and lie within the specified range is known as range-checking. This is the procedure implemented in timestamp_range_check.rs, using a modified version of Lasso.

Intuitively, checking that each read timestamp does not exceed the global timestamp prevents an attacker from answering all read operations to a given cell with "the right set of values, but out of order". Such an attack requires the attacker to "jump forward and backward in time". That is, for this attack to succeed, at some timestamp when the cell is read, the attacker would have to return a value that will be written to that cell in the future (and at some later timestamp t' when the same cell is read the attacker would have to return a value that was written to that cell much earlier). This attack is prevented by confirming that all values returned have a timestamp that does not exceed the current global timestamp.

Word-addressable memory

According to the RISC-V specification, the RISC-V memory is byte-addressable, i.e. load and store instructions can access any memory address, with no restrictions on whether the address is aligned to a 4-byte word. However, the specification caveats that unaligned accesses may incur performance penalties on hardware, and some RISC-V implementations may choose to disallow unaligned memory accesses entirely. This means that LW and SW must access memory addresses that are word-aligned (i.e. multiples of 4), and LH, LHU, and SH must access memory addresses that are halfword-aligned (i.e. multiples of 2). There are no such restrictions on LB, LBU and SB, as they only access a single byte.

Jolt disallows unaligned memory accesses for prover performance reasons; one cannot generate a valid Jolt proof for an execution trace that includes unaligned memory accesses. Enforcing aligned accesses allows Jolt to treat memory as word-addressable, which reduces the number of committed polynomials and instances of offline memory-checking.

In more detail, Jolt

  1. constrains all LW and SW instructions to only allow word-aligned accesses, and constrains all LH, LHU, and SH instructions to only allow halfword-aligned accesses. This is accomplished using virtual ASSERT instructions (see Section 6.1.1 of the Jolt paper).
  2. replaces all LH, LHU, SH, LB, LBU, and SB instructions with virtual sequences that only perform word-aligned accesses.

LB virtual sequence

  1. ADDI rs1, --, imm, // Compute the memory address being accessed
  2. ANDI , --, (1 << 32) - 4, // Mask out the lower bits to obtain the word-aligned address
  3. LW , --, 0, // Load the full word
  4. XORI , --, 0b11, // Compute the number of bytes to shift the word (in the lower 2 bits)
  5. SLLI , --, 3, // Compute the number of bits to shift the word (in the lower 5 bits)
  6. SLL , , --, rd // Shift the word so that the desired byte is left-aligned
  7. SRAI rd, --, 24, rd // Right arithmetic shift to sign-extend and right-align the byte

LBU virtual sequence

  1. ADDI rs1, --, imm, // Compute the memory address being accessed
  2. ANDI , --, (1 << 32) - 4, // Mask out the lower bits to obtain the word-aligned address
  3. LW , --, 0, // Load the full word
  4. XORI , --, 0b11, // Compute the number of bytes to shift the word (in the lower 2 bits)
  5. SLLI , --, 3, // Compute the number of bits to shift the word (in the lower 5 bits)
  6. SLL , , --, rd // Shift the word so that the desired byte is left-aligned
  7. SRLI rd, --, 24, rd // Right logical shift to zero-extend and right-align the byte

LH virtual sequence

  1. ASSERT_HALFWORD_ALIGNMENT rs1, --, imm, -- // Virtual instruction to enforce aligned memory access
  2. ADDI rs1, --, imm, // Compute the memory address being accessed
  3. ANDI , --, (1 << 32) - 4, // Mask out the lower bits to obtain the word-aligned address
  4. LW , --, 0, // Load the full word
  5. XORI , --, 0b10, // Compute the number of bytes to shift the word (in the lower 2 bits)
  6. SLLI , --, 3, // Compute the number of bits to shift the word (in the lower 5 bits)
  7. SLL , , --, rd // Shift the word so that the desired halfword is left-aligned
  8. SRAI rd, --, 16, rd // Right arithmetic shift to sign-extend and right-align the halfword

LHU virtual sequence

  1. ASSERT_HALFWORD_ALIGNMENT rs1, --, imm, -- // Virtual instruction to enforce aligned memory access
  2. ADDI rs1, --, imm, // Compute the memory address being accessed
  3. ANDI , --, (1 << 32) - 4, // Mask out the lower bits to obtain the word-aligned address
  4. LW , --, 0, // Load the full word
  5. XORI , --, 0b10, // Compute the number of bytes to shift the word (in the lower 2 bits)
  6. SLLI , --, 3, // Compute the number of bits to shift the word (in the lower 5 bits)
  7. SLL , , --, rd // Shift the word so that the desired halfword is left-aligned
  8. SRLI rd, --, 16, rd // Right logical shift to zero-extend and right-align the halfword

SB virtual sequence

  1. ADDI rs1, --, imm, // Compute the memory address being accessed
  2. ANDI , --, (1 << 32) - 4, // Mask out the lower bits to obtain the word-aligned address
  3. LW , --, 0, // Load the full word
  4. SLLI , --, 3, // Compute the number of bits to shift the byte (in the lower 5 bits)
  5. LUI --, --, 0xff, // Load the bitmask into a virtual register
  6. SLL , , --, // Shift the bitmask into the position of the desired byte
  7. SLL rs2, , --, // Shift the value being stored to align with the bitmask
  8. XOR , , --,
  9. AND , , --,
  10. XOR , , --,
  11. SW , , 0, -- // Store the updated word

Instructions 8-10 use a bit-twiddling hack to mask the stored byte into the word.

SH virtual sequence

  1. ASSERT_HALFWORD_ALIGNMENT rs1, --, imm, -- // Virtual instruction to enforce aligned memory access
  2. ADDI rs1, --, imm, // Compute the memory address being accessed
  3. ANDI , --, (1 << 32) - 4, // Mask out the lower bits to obtain the word-aligned address
  4. LW , --, 0, // Load the full word
  5. SLLI , --, 3, // Compute the number of bits to shift the halfword (in the lower 5 bits)
  6. LUI --, --, 0xffff, // Load the bitmask into a virtual register
  7. SLL , , --, // Shift the bitmask into the position of the desired halfword
  8. SLL rs2, , --, // Shift the value being stored to align with the bitmask
  9. XOR , , --,
  10. AND , , --,
  11. XOR , , --,
  12. SW , , 0, -- // Store the updated word

Instructions 9-11 use a bit-twiddling hack to mask the stored byte into the word.

Bytecode

Jolt proves the validity of registers and RAM using offline memory checking.

Decoding an ELF file

The tracer iterates over the .text sections of the program ELF file and decode RISC-V instructions. Each instruction gets mapped to a BytecodeRow struct:


#![allow(unused)]
fn main() {
pub struct BytecodeRow {
    /// Memory address as read from the ELF.
    address: usize,
    /// Packed instruction/circuit flags, used for r1cs
    bitflags: u64,
    /// Index of the destination register for this instruction (0 if register is unused).
    rd: u64,
    /// Index of the first source register for this instruction (0 if register is unused).
    rs1: u64,
    /// Index of the second source register for this instruction (0 if register is unused).
    rs2: u64,
    /// "Immediate" value for this instruction (0 if unused).
    imm: u64,
}
}

The registers (rd, rs1, rs2) and imm are described in the RISC-V spec. The bitflags field is described in greater detail below.

The preprocessed bytecode serves as the (read-only) "memory" over which we perform offline memory checking.

bytecode_trace

We currently assume the program bytecode is sufficiently short and known by the verifier, so the prover doesn't need to provide a commitment to it –– the verifier can compute the MLE of the bytecode on its own. This guarantees that the bytecode trace is consistent with the agreed-upon program.

Instead of tuples, each step in the bytecode trace is mapped to a tuple . Otherwise, the offline memory checking proceeds as described in Offline Memory Checking.

Bitflags

The bitflags of a given instruction is the concatenation of its circuit flags and instruction flags. This concatenation is enforced by R1CS constraints.

bitflags

R1CS constraints

Jolt uses R1CS constraints to enforce certain rules of the RISC-V fetch-decode-execute loop and to ensure consistency between the proofs for the different modules of Jolt (instruction lookups, read-write memory, and bytecode).

Uniformity

Jolt's R1CS is uniform, which means that the constraint matrices for an entire program are just repeated copies of the constraint matrices for a single CPU step. Each step is conceptually simple and involves around 60 constraints and 80 variables.

Input Variables and constraints

The inputs required for the constraint system for a single CPU step are:

Pertaining to bytecode

  • Bytecode read address: the index in the program code read at this step.
  • The preprocessed representation of the instruction: (elf_address, bitflags, rs1, rs2, rd, imm).

Pertaining to read-write memory

  • The (starting) RAM address read by the instruction: if the instruction is not a load/store, this is 0.
  • The bytes written to or read from memory.

Pertaining to instruction lookups

  • The chunks of the instruction's operands x and y.
  • The chunks of the lookup query. These are typically some combination of the operand chunks (e.g. the i-th chunk of the lookup query is often the concatenation of x_i and y_i).
  • The lookup output.

Circuit and instruction flags:

  • There are twelve circuit flags (opflags in the Jolt paper) used in Jolt's R1CS constraints. They are enumatered in CircuitFlags and computed in to_circuit_flags (see rv_trace.rs) Circuit flags depend only on the instruction as it appears in the bytecode, so they are computed as part of the preprocessed bytecode in Jolt.
    1. LeftOperandIsPC: 1 if the first lookup operand is the program counter; 0 otherwise (first lookup operand is RS1 value).
    2. RightOperandIsImm: 1 if the second lookup operand is imm; 0 otherwise (second lookup operand is RS2 value).
    3. Load: 1 if the instruction is a load (i.e. LB, LH, etc.)
    4. Store: 1 if the instruction is a store (i.e. SB, SH, etc.)
    5. Jump: 1 if the instruction is a jump (i.e. JAL, JALR)
    6. Branch: 1 if the instruction is a branch (i.e. BEQ, BNE, etc.)
    7. WriteLookupOutputToRD: 1 if the lookup output is to be stored in rd at the end of the step.
    8. ImmSignBit: Used in load/store and branch instructions where the immediate value used as an offset
    9. ConcatLookupQueryChunks: Indicates whether the instruction performs a concat-type lookup.
    10. Virtual: 1 if the instruction is "virtual", as defined in Section 6.1 of the Jolt paper.
    11. Assert: 1 if the instruction is an assert, as defined in Section 6.1.1 of the Jolt paper.
    12. DoNotUpdatePC: Used in virtual sequences; the program counter should be the same for the full seqeuence.
  • Instruction flags: these are the unary bits used to indicate instruction is executed at a given step. There are as many per step as the number of unique instruction lookup tables in Jolt.

Constraint system

The constraints for a CPU step are detailed in the uniform_constraints and non_uniform_constraints functions in constraints.rs.

Reusing commitments

As with most SNARK backends, Spartan requires computing a commitment to the inputs to the constraint system. A catch (and an optimization feature) in Jolt is that most of the inputs are also used as inputs to proofs in the other modules. For example, the address and values pertaining to the bytecode are used in the bytecode memory-checking proof, and the lookup chunks, output and flags are used in the instruction lookup proof. For Jolt to be sound, it must be ensured that the same inputs are fed to all relevant proofs. We do this by re-using the commitments themselves. Spartan is adapted to take pre-committed witness variables.

Exploiting uniformity

The uniformity of the constraint system allows us to heavily optimize both the prover and verifier. The main changes involved in making this happen are:

  • Spartan is modified to only take in the constraint matrices a single step, and the total number of steps. Using this, the prover and verifier can efficiently calculate the multilinear extensions of the full R1CS matrices.
  • The commitment format of the witness values is changed to reflect uniformity. All versions of a variable corresponding to each time step is committed together. This affects nearly all variables committed to in Jolt.
  • The inputs and witnesses are provided to the constraint system as segments.
  • Additional constraints are used to enforce consistency of the state transferred between CPU steps.

These changes and their impact on the code are visible in spartan.rs.

M extension

Jolt supports the RISC-V "M" extension for integer multiplication and division. The instructions included in this extension are described here. For RV32, the M extension includes 8 instructions: MUL, MULH, MULHSU, MULU, DIV, DIVU, REM, and REMU.

The Jolt paper describes how to handle the M extension instructions in Section 6, but our implementation deviates from the paper in a couple ways (described below).

Virtual sequences

Section 6.1 of the Jolt paper introduces virtual instructions and registers –– some of the M extension instructions cannot be implemented as a single subtable decomposition, but rather must be split into a sequence of instructions which together compute the output and places it in the destination register. In our implementation, these sequences are captured by the VirtualInstructionSequence trait.

The instructions that comprise such a sequence can be a combination of "real" RISC-V instructions and "virtual" instructions which only appear in the context of virtual sequences. We also introduce 32 virtual registers as "scratch space" where instructions in a virtual sequence can write intermediate values.

Deviations from the Jolt paper

There are three inconsistencies between the virtual sequences provided in Section 6.3 of the Jolt paper, and the RISC-V specification. Namely:

  1. The Jolt prover (as described in the paper) would fail to produce a valid proof if it encountered a division by zero; since the divisor y is 0, the ASSERT_LTU/ASSERT_LT_ABS would always fail (for DIVU and DIV, respectively).
  2. The MLE provided for ASSERT_LT_ABS in Section 6.1.1 doesn't account for two's complement.
  3. The ASSERT_EQ_SIGNS instruction should always return true if the remainder is 0.

To address these issues, our implementation of DIVU, DIV, REMU, and REM deviate from the Jolt paper in the following ways.

DIVU virtual sequence

  1. ADVICE --, --, --, // store non-deterministic advice into
  2. ADVICE --, --, --, // store non-deterministic advice into
  3. MUL , , --, // compute q * y
  4. ASSERT_VALID_UNSIGNED_REMAINDER , , --, -- // assert that y == 0 || r < y
  5. ASSERT_LTE , , --, -- // assert q * y <= x
  6. ASSERT_VALID_DIV0 , , --, -- // assert that y != 0 || q == 2 ** WORD_SIZE - 1
  7. ADD , , --, // compute q * y + r
  8. ASSERT_EQ , , --, --
  9. MOVE , --, --, rd

REMU virtual sequence

  1. ADVICE --, --, --, // store non-deterministic advice into
  2. ADVICE --, --, --, // store non-deterministic advice into
  3. MUL , , --, // compute q * y
  4. ASSERT_VALID_UNSIGNED_REMAINDER , , --, -- // assert that y == 0 || r < y
  5. ASSERT_LTE , , --, -- // assert q * y <= x
  6. ADD , , --, // compute q * y + r
  7. ASSERT_EQ , , --, --
  8. MOVE , --, --, rd

DIV virtual sequence

  1. ADVICE --, --, --, // store non-deterministic advice into
  2. ADVICE --, --, --, // store non-deterministic advice into
  3. ASSERT_VALID_SIGNED_REMAINDER , , --, -- // assert that r == 0 || y == 0 || (|r| < |y| && sign(r) == sign(y))
  4. ASSERT_VALID_DIV0 , , --, -- // assert that y != 0 || q == 2 ** WORD_SIZE - 1
  5. MUL , , --, // compute q * y
  6. ADD , , --, // compute q * y + r
  7. ASSERT_EQ , , --, --
  8. MOVE , --, --, rd

REM virtual sequence

  1. ADVICE --, --, --, // store non-deterministic advice into
  2. ADVICE --, --, --, // store non-deterministic advice into
  3. ASSERT_VALID_SIGNED_REMAINDER , , --, -- // assert that r == 0 || y == 0 || (|r| < |y| && sign(r) == sign(y))
  4. MUL , , --, // compute q * y
  5. ADD , , --, // compute q * y + r
  6. ASSERT_EQ , , --, --
  7. MOVE , --, --, rd

R1CS constraints

Ciruict flags

With the M extension we introduce the following circuit flags:

  1. is_virtual: Is this instruction part of a virtual sequence?
  2. is_assert: Is this instruction an ASSERT_* instruction?
  3. do_not_update_pc: If this instruction is virtual and not the last one in its sequence, then we should not update the PC. This is because all instructions in virtual sequences are mapped to the same ELF address.

Uniform constraints

The following constraints are enforced for every step of the execution trace:

  1. If the instruction is a MUL, MULU, or MULHU, the lookup query is the product of the two operands x * y (field multiplication of two 32-bit values).
  2. If the instruction is a MOV or MOVSIGN, the lookup query is a single operand x (read from the first source register rs1).
  3. If the instruction is an assert, the lookup output must be true.

Program counter constraints

Each instruction in the preprocessed bytecode contains its (compressed) memory address as given by the ELF file. This is used to compute the expected program counter for each step in the program trace.

If the do_not_update_pc flag is set, we constrain the next PC value to be equal to the current one. This handles the fact that all instructions in virtual sequences are mapped to the same ELF address.

This also means we need some other mechanism to ensure that virtual sequences are executed in order and in full. If the current instruction is virtual, we can constrain the next instruction in the trace to be the next instruction in the bytecode. We observe that the virtual sequences used in the M extension don't involve jumps or branches, so this should always hold, except if we encounter a virtual instruction followed by a padding instruction. But that should never happend because an execution trace should always end with some return handling, which shouldn't involve a virtual sequence.

The design of Jolt (the built system) leans heavily on a simple but powerful technique that is not described in the Jolt paper.

Jolt uses the technique to achieve what is sometimes called "a la carte" prover costs: for each cycle of the VM, the Jolt prover only "pays" for the primitive instruction that was actually executed at that cycle, regardless of how many instructions are in the Instruction Set Architecture (in Jolt's case, RISC-V). This design will allow Jolt to be extended to richer instruction sets (which is equivalent to supporting different pre-compiles) without increases in per-cycle prover cost.

The technique is already incorporated into Jolt in the context of lookups, but it's simplest to explain in the context of pre-compiles. Here is the rough idea. Suppose a pre-compile is implemented via some constraint system (say, R1CS for concreteness). As a first attempt at supporting the pre-compile, we can include into Jolt a single, data-parallel constraint system that "assumes" that the pre-compile is executed at every single cycle that the VM ran for. There are two problems with this. First, the pre-compile will not actually be executed at every cycle, so we need a way to "turn off" all the constraints for every cycle at which the pre-compile is not executed. Second, we do not want the prover to "pay" in runtime for the constraints that are turned off.

To address both of these issues, for each cycle we have the prover commit to a binary flag indicating whether or not that pre-compile was actually executed at that cycle.

For each cycle consider the 'th constraint that is part of the pre-compile, say:

Here, one should think of as an execution trace for the VM, i.e., a list of everything that happened at each and every cycle of VM. In Jolt, there are under 100 entries of per cycle (though as we add pre-compiles this number might grow).

We modify each constraint by multiplying the left-hand side by the binary flag , i.e.,

Now, for any cycle where this pre-compile is not executed, the prover can simply assign any variables in that are only "used" by the pre-compile to 0. This makes these values totally "free" to commit to when using any commitment scheme based on elliptic curves (after we switch the commitment scheme to Binius, committing to 0s will no longer be "free", but we still expect the commitment costs to be so low that it won't matter a huge amount). These variables might not satisfy the original constraint but will satisfy the modified one simply because will be set to 0. We say that "turned off'' the constraint.

Moreover, in sum-check-based SNARKs for R1CS like Spartan, there are prover algorithms dating to CTY11 and CMT12 where the prover's runtime grows only with the number of constraints where the associated flag is not zero. So the prover effectively pays nothing at all (no commitment costs, no field work) for constraints at any cycle where the pre-compile is not actually executed. We call these algorithms "streaming/sparse sum-check proving." In a happy coincidence, these are the same algorithms we expect to use to achieve a streaming prover (i.e., to control prover space without recursion).

More precisely, the prover time in these algorithms is about field operations per round of sum-check, where is the number of constraints with associated binary flag not equal to . But if is the total number of constraints (including those that are turned off), we can "switch over" to the standard "dense" linear-time sum-check proving algorithm after enough rounds pass so that . In Jolt, we expect this "switchover" to happen by round or . In the end, the amount of extra field work done by the prover owing to the sparsity will only be a factor of or so.

Jolt uses this approach within Lasso as well. Across all of the primtive RISC-V instructions, there are about 80 "subtables" that get used. Any particular primitive instruction only needs to access between 4 and 10 of these subtables. We "pretend" that every primitive instruction actually accesses all 80 of the subtables, but use binary flags to "turn off" any subtable lookups that do not actually occur (i.e., that is, we tell the grand product in the guts of the Lasso lookup argument for that subtable to ``ignore'' that lookup). This is why about 93% of the factors multiplied in Jolt's various grand product arguments are 1. The same algorithmic approach ensures that the "turned off lookups" do not increase commitment time, and that our grand product prover does not pay any field work for the "turned off" subtable lookups.

There are alternative approaches we could take to achieve "a la carte" prover costs, e.g., vRAM's approach of having the prover sort all cycles by which primitive operation or pre-compile was executed at that cycle (see also the much more recent work Ceno). But the above approach is compatable with a streaming prover, avoids committing to the same data multiple times, and has other benefits.

We call this technique (fast proving for) "sparse constraint systems". Note that the term sparse here does not refer to there being the sparsity of the R1CS constraint matrices themselves, but rather to almost all of the left-hand sides of the constraints being $0$ when the constraints are evaluated on the valid witness vector $z$.

Background

If you're interested in learning the nitty gritty details of SNARKs including questions like 'why', 'is it sound', you should read Proofs and Args ZK by Thaler. If you're simply interested in 'how', this section is for you.

The Sum-check Protocol

Adapted from the Thaler13 exposition. For a detailed introduction to sumcheck see Chapter 4.1 of the textbook.

Suppose we are given a -variate polynomial defined over a finite field . The purpose of the sumcheck protocol is to compute the sum:

In order to execute the protocol, the verifier needs to be able to evaluate for a randomly chosen vector . Hence, from the verifier's perspective, the sum-check protocol reduces the task of summing 's evaluations over inputs (namely, all inputs in ) to the task of evaluating at a single input in .

The protocol proceeds in rounds as follows. In the first round, the prover sends a polynomial , and claims that

Observe that if is as claimed, then . Also observe that the polynomial has degree , the degree of variable in . Hence can be specified with field elements. In our implementation, will specify by sending the evaluation of at each point in the set . (Actually, the prover does not need to send , since since the verifier can infer that , as if this were not the case, the verifier would reject).

Then, in round , chooses a value uniformly at random from and sends to . We will often refer to this step by saying that variable gets bound to value . In return, the prover sends a polynomial , and claims that

The verifier compares the two most recent polynomials by checking that , and rejecting otherwise. The verifier also rejects if the degree of is too high: each should have degree , the degree of variable in .

In the final round, the prover has sent which is claimed to be . now checks that (recall that we assumed can evaluate at this point). If this test succeeds, and so do all previous tests, then the verifier accepts, and is convinced that .

Multilinear Extensions

For any -variate polynomial polynomial, it's multilinear extension is the polynomial which agrees over all points : . By the Schwartz-Zippel lemma, if and disagree at even a single input, then and must disagree almost everywhere.

For more precise details please read Section 3.5 of Proofs and Args ZK.

Engineering

In practice, MLE's are stored as the vector of evaluations over the -variate boolean hypercube . There are two important algorithms over multilinear extensions: single variable binding, and evaluation.

Single Variable Binding

With a single streaming pass over all evaluations we can "bind" a single variable of the -variate multilinear extension to a point . This is a critical sub-algorithm in sumcheck. During the binding the number of evaluation points used to represent the MLE gets reduced by a factor of 2:

  • Before:
  • After:

Assuming your MLE is represented as a 1-D vector of evaluations over the -variate boolean hypercube , indexed little-endian

Then we can transform the vector representing the transformation from by "binding" the evaluations vector to a point .

let n = 2^v;
let half = n / 2;
for i in 0..half {
    let low = E[i];
    let high = E[half + i];
    E[i] = low + r * (high - low);
}

Multi Variable Binding

Another common algorithm is to take the MLE and compute its evaluation at a single -variate point outside the boolean hypercube . This algorithm can be performed in time by preforming the single variable binding algorithm times. The time spent on 'th variable binding is , so the total time across all bindings is proportional to .

Eq Extension

The multilinear extension (MLE) is useful throughout the protocol. It is the MLE of the function defined as follows:

has the following explicit formula:

Batched Openings

At the end of the sumcheck protocol, the verifier must evaluate many multilinear polynomials at a single point. For polynomials which the verifier cannot compute on their own, they depend on the verification of a PCS opening proof provided by the prover. To save on verifier costs, all polynomials opened at the same point can be combined to a single opening proof.

The best reading on the subject can be found in Section 16.1 of the Textbook.

Offline Memory Checking

Offline memory checking is a method that enables a prover to demonstrate to a verifier that a read/write memory was used correctly. In such a memory system, values can be written to addresses and subsequently retrieved. This technique allows the verifier to efficiently confirm that the prover adhered to the memory's rules (i.e., that the value returned by any read operation is indeed the most recent value that was written to that memory cell).

Jolt utilizes offline memory checking in the Bytecode prover, Lookup prover (for VM instruction execution), and RAM prover. The RAM prover must support a read-write memory. The Bytecode prover and Lookup prover need only support read-only memories. In the case of the Bytecode prover, the memory is initialized to contain the Bytecode of the RISC-V program, and the memory is never modified (it's read-only). And the lookup tables used for VM instruction execution are determined entirely by the RISC-V instruction set.

(The term "offline memory checking" refers to techniques that check the correctness of all read operations "all at once", after the reads have all occurred--or in SNARK settings, after the purported values returned by the reads have been committed. Off-line checking techniques do not determine as a read happens whether or not it was correct. They only ascertain, when all the reads are checked at once, whether or not all of the reads were correct.

This is in contrast to "online memory checking" techniques like Merkle hashing that immediately confirm that a memory read was done correctly by insisting that each read includes an authentication path. Merkle hashing is much more expensive on a per-read basis for SNARK provers, and offline memory checking suffices for SNARK design. This is why Lasso and Jolt use offline memory checking techniques rather than online).

Initialization Algorithm

TODO:

  • Initialize four timestamp counters
  • Implicitly assume a read operation before each write

Multiset Check

Define and as subsets, and and as subsets: Here, , , and represent the address, value, and timestamp respectively, with being the total number of memory operations and the size of the RAM.

The verifier checks that the combination of and matches and , disregarding the sequence of elements, known as a permutation check: Jolt conducts this check using a homomorphic hash function applied to each set: The hash function is defined as:

This multiset hashing process is represented by a binary tree of multiplication gates and is computed using an optimized GKR protocol.

References

GKR

GKR is a SNARK protocol for small-depth circuits of multiplication and addition gates. The standard form allows combinations of both. The protocol involves three MLEs: is the MLE of a function capturing the values of the gates at layer , and and are MLEs of functions capturing the circuit's wires.

Specifically, evaluates to the value of the circuit at the -th layer in the -th gate. For example corresponds to the output gate.

evaluates to 1 if the -th gate of the -th layer is an addition gate whose inputs are the 'th and 'th gates at the preceding layer.

evaluates to 1 if the -th gate of the -th layer is a multiplication gate whose inputs are the 'th and 'th gates at the preceding layer.

The sumcheck protocol is applied to the following:

where

(See the eq-polynomial page for details.)

Lasso and Jolt currently use the variant of GKR described in Thaler13, which is optimized for the far simpler case of a binary tree of multiplication gates. This simplifies each sumcheck to:

where

GKR is utilized in memory-checking for the multi-set permutation check.

Binius

Binius was written by Ben Diamond and Jim Posen of Irreducible (fka Ulvetanna). It's a new commitment scheme that allows Jolt to use smaller fields more efficiently.

The Binius paper also gives a number of sum-check-based polynomial IOPs to be combined with the commitment scheme. For Jolt's purposes, these will yield extremely efficient protocols for proving hash evaluations with various standard hash functions like Keccak (important for recursing Jolt-with-Binius-commitment), and for the RISC-V addition and multiplication operations.

Here is a brief summary of changes that will occur in Jolt in order to incorporate the Binius commitment scheme:

*Addition and multiplication instructions will be handled via gadgets/constraint-systems instead of lookups. We may want to handle XOR and LT via gadgets too for performance reasons.

*Booleanity constraints in the R1CS can be omitted as that is guaranteed by the binding properties of the Binius commitment scheme.

*In offline memory checking (both Lasso for lookups a.k.a. read-only memory, and Spice for read-write memory), access counts need to be incremented. The Binius paper (Section 4.4) proposes to do this by sticking counts in the exponent of a generator of an appropriate multiplicative subgroup of , but it seems more efficient to plan to do the count increments via the addition gadget.

For Spice, the count written during a write operation is set to the maximum of a global counter and the count returned by the preceding read operation, plus 1. If the prover is honest, this written count will always equal the global counter plus 1. This max can be implemented by having the prover commit to the binary representation of and using the addition gadget to confirm that .

*Constraints that involve addition by IMM, the immediate value, will need to implement addition via gadget.

*In the "collation" sum-check of Lasso that collates results of sub-table lookups into the result of the original big-table lookup, the algebraic implementation of "concatenate the results into one field element" changes. Over a large prime-order field, concatenating two -bit values and corresponds to . Over constructed via towering, gets replaced with an appropriate element of the tower basis. The same change affects various R1CS constraints ensuring that certain committed values are a limb decomposition of another committed value.

*Many performance optimizations become available within the various invocations of the sum-check protocol, leveraging that has small characteristic and most values being summed are in for small .

Multiplicative Generator

Finding a Multiplicative Generator for

To find a generator of the multiplicative group , which includes all nonzero elements of the field , a probabilistic method is used. This method iteratively tests random elements to determine if they are generators of the group. The process is summarized below.

Probabilistic Method Algo

while true:
    # sample random nonzero element x
    for each maximal proper divisor d of |𝔽*|:
        if xᵈ == 1: continue
    return x

This algorithm ensures that the element does not reside in any proper subgroup of . The condition checked within the loop, , ensures that has the maximal possible order, which is necessary for it to be a generator of the group.

Background:

  • Group Structure: is the multiplicative group of the field , containing all field elements except 0. Its order is , which is prime to 2.
  • Maximal Proper Divisors: These divisors are crucial as testing powers for these values ensures that is not a member of any smaller cyclic subgroup. The divisors of are used to verify the generator property.
  • Probabilistic Success Rate: The likelihood of a random element being a generator is significant, given the structure of the group. This can be supported by the [[chinese-remainder-theorem]], which suggests that the intersection of several smaller groups (subgroups corresponding to divisors) is likely non-trivial only when all subgroup conditions are simultaneously satisfied.

Example Divisors:

  • For ,
  • For ,
  • For ,

Maximal Proper Divisors

A maximal proper divisor of a number is a divisor of which is neither nor itself, and there are no other divisors of that divide except and . Essentially, is a divisor that is not a multiple of any smaller divisor other than and itself, making it 'maximal' under the set of proper divisors.

Algorithm for Finding

The algorithm to find the maximal proper divisors of a given number involves identifying all divisors of and then selecting those which do not have other divisors besides and themselves. The steps are as follows:

  1. Find All Divisors: First, list all divisors of by checking for every integer from to if divides . If divides , then both and are divisors.
  2. Filter Maximal Divisors: From this list of divisors, exclude and . For each remaining divisor, check if it can be expressed as a multiple of any other divisor from the list (other than and itself). If it cannot, then it is a maximal proper divisor.
function findMaximalProperDivisors(n):
    divisors = []
    for i from 1 to sqrt(n):
        if n % i == 0:
            divisors.append(i)
            if i != n / i:
                divisors.append(n / i)
    divisors.remove(1)
    divisors.remove(n)

    maximal_proper_divisors = []
    for d in divisors:
        is_maximal = true
        for e in divisors:
            if d != e and e != 1 and d % e == 0:
                is_maximal = false
                break
        if is_maximal:
            maximal_proper_divisors.append(d)

    return maximal_proper_divisors

RISC-V

RISC-V is an open source instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. Every RISC-V implementation must include the base integer ISA, with optional extensions available for additional functionality.

Jolt implements the base RISC-V instruction set, making it a RISC-V-compliant virtual machine. This means Jolt can execute and prove any code that compiles to RISC-V.

Supported Instruction Sets

Current ISA Configuration: RV32IM

Base Sets

RV32I

The RV32I is the base 32-bit integer instruction set. It's designed to be sufficient for a complete software toolchain while being simple and minimal. Everything else in RISC-V (multiplication, floating point, atomic operations)q is built as extensions on top of this base ISA.

Key properties:
  • 32-bit fixed-width instructions

  • 32 integer registers (x0-x31), where x0 is hardwired to zero. Register x1/ra is reserved for return address linkage by jump-and-link instructions, x2/sp is conventionally used as stack pointer. Each register is 32 bits wide and used for both integer and memory address computations.

  • 32-bit address space (byte-addressed and little-endian). Memory accesses can be to byte (8-bit), halfword (16-bit), or word (32-bit) sized values

  • Basic arithmetic operations (add, subtract, shift, logical operations)

  • Load/store architecture means memory can only be accessed through dedicated load and store instructions - all other instructions operate only on registers

  • Simple addressing mode of base register plus 12-bit signed immediate offset. No complex memory addressing modes or memory-to-memory operations

  • Conditional branches and jumps are supported

For detailed instruction formats and encoding, refer to the chapter 2 of specification

Extensions

"M" Standard Extension for Integer Multiplication and Division

Key properties:
  • Multiplication operations generate 32-bit (lower) or 64-bit (full) results

  • Separate signed and unsigned multiply instructions

  • Hardware division with both signed and unsigned variants

  • All operations work on values in integer registers

  • Divide-by-zero results in a well-defined result (maximum unsigned value)

  • Maintains the simple register-based architecture of RV32I

  • Results always written to a single 32-bit register (for upper/lower multiplication results, two separate instructions are used)

  • All instructions in this extension are encoded in the standard 32-bit RISC-V format

Core Operations:
  • MUL: Multiplication, lower 32-bit result

  • MULH/MULHU/MULHSU: Upper 32-bit multiplication (signed×signed, unsigned×unsigned, signed×unsigned)

  • DIV/DIVU: Signed and unsigned division

  • REM/REMU: Signed and unsigned remainder

For detailed instruction formats and encoding, refer to chapter 7 of specification

LLVM

LLVM is a versatile compiler infrastructure that supports a variety of languages and architectures. RISC-V is fully supported by the LLVM compiler infrastructure:

  • Official RISC-V backend in LLVM

  • Support for all standard extensions (M, A, F, D, C)

  • Integration with standard LLVM tools (clang, lld, lldb)

  • Support for both 32-bit and 64-bit targets

One of the key features of LLVM is its Intermediate Representation (IR). IR serves as a bridge between high-level languages and machine code. This means any language that compiles to LLVM IR (like C, C++, Rust, etc.) can be compiled to RISC-V and subsequently proven by jolt:

Compilation to RISC-V target

References

Dev

For more information on setting up your development environment, see Installation Guide.

For details on the tools used in the Jolt development process, refer to Development Tools.

Development Installation

Rust

  • curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  • Rustup should automatically install Rust toolchain and necessary targets on the first cargo invocation. If you need to add the RISC-V target for building guest programs manually use rustup target add riscv32im-unknown-none-elf.

mdBook

For building this book:

cargo install mdbook mdbook-katex

For watching the changes in your local browser:

mdbook watch book --open

Development Tools

Tracing

Jolt is instrumented using tokio-rs/tracing. These traces can be displayed using the --format chrome flag, for example: cargo run -p jolt-core --release -- trace --name sha2-chain --format chrome

After tracing, files can be found in the workspace root with a name trace-<timestamp>.json. Load these traces into perfetto.

Tracing in Jolt

Often it's easiest to debug performance for a particular segment by adding granular tracing, adjusting code, rerunning the sha2-chain end-to-end benchmark and looking through the Chrome traces.

Objdump

Debugging the emulator / tracer can be hard. Use riscv64-unknown-elf-objdump to compare the actual ELF to the .bytecode / .jolttrace files.

All of the open tasks are tracked via Github Issues.

Jolt Skill Tree

Improvements complete or in progress since the initial release of Jolt in April 2024

There are many tasks described in the various files in the "roadmap/future" section of the wiki. Let's use this file as the primary means of tracking what is done or in progress. Anything not mentioned in this file is presumably not yet started (or barely started).

Functionality improvements

  • Support for stdlib

  • Support for M-extension.

  • In progress: on-chain verifier (Solidity).

Verifier cost improvements

  • Add support for HyperKZG commitment

  • Add support for Quarks/Spartan grand product argument and hybrid grand product (which achieves most of the verifier benefits of Quarks without a significant hit to prover time).

    • Note: the cost of Quarks/Spartan and hybrid grand products (i.e., time to commit to partial products) were improved subsequent to initial implementation, via this PR.
  • In progress: change how we are batching grand products, treating them all laid side-by-side as one giant circuit. This will reduce proof size by up to 200KB.

  • Reduce the number of polynomial evaluation proofs from 7-10 down to 1 (achieved in this PR)

Prover cost improvements (all in progress)

  • Eliminate cost of pre-computed tables of eq evaluations for each sum-check, as per Dao-Thaler.

  • (Nearly) eliminate second sum-check instance in Spartan.

  • Implement the sum-check prover optimization from Section 3 of Angus Gruen's paper.

  • Implement the sum-check prover optimizations from Bagad-Domb-Thaler, which actually apply whenever small values are being summed, even if those values reside in a big (e.g., 256-bit) field. This captures Spartan as applied in Jolt. Thanks to Lev Soukhanov for this observation.

  • Replace byte-addressable memory with word-addressable memory. This is actually implemented, but the above speedups to Spartan proving will need to be implemented before it yields an overall performance improvement.

  • AVX-512 speedups (see here for a detailed description of in-progress efforts).

  • GPU integration.

  • Prover space control via folding (see here for a sketchy overview of how this will be implemented).

  • Eliminate some unnecessary lookups identified via formal verification efforts.

Contributors to proof size

With HyperKZG commitments, Jolt's proof size (with no composition/recursion) today (or in the very near term) is 35KB-200KB.

Here is the breakdown of contributors to proof size:

  1. In Jolt, we commit to about 250 polynomials and produce just one HyperKZG evaluation proof. It's one group element (about 32 bytes) per commitment, and an evaluation proof is a couple of dozen group group elements. So this is about 9 KBs total.

    With some engineering headaches we could go below 9 KBs by committing to some of the 250 polynomials as a single, larger polynomial, but we do not currently plan to implement this.

    Most of the 250 polynomials come from the "subtables" used in Lasso, as each subtable arising in Lasso brings several committed polynomials.

  2. Jolt runs six grand product arguments in parallel: two for Lasso lookups into "instruction evaluation tables", two for Lasso lookups into the RISC-V bytecode, and two for the Spice read-write-memory checking. These six grand products dominate the proof size. How big the proofs are depends on which grand product argument is used.

    With pure Thaler13/GKR, the proof size is about 200KB (~30 KB per grand product).

    With pure Quarks/Spartan, the proof size is only about 25KB. But this grand product argument is about 10x slower than Thaler13.

    With a "hybrid" between the two that avoids a significant prover slowdown, the proof size is about 100 KB.

    Eventually, with modest engineering effort, we can reduce these 6 grand products down to 3. This would reduce proof size by another factor of 2 or so. See bottom of this page for further details.

  3. Spartan for R1CS involves only two sum-checks (soon to be reduced to one) so contributes about 7 KBs to the proof (soon to fall to 3-4 KBs)

Details on grand products

The 6 grand product arguments run in Jolt stem from the following:

  • Two for instruction lookups (Lasso).
  • Two for read/write memory (Spice).
  • Two for bytecode lookups (Lasso).

For each of the above, one grand product attests to the validity of reads and writes into the relevant memories, and one attests to the validity of initialization of memory plus a final pass over memory. The reason we do not run these grand products "together as one big grand product" is they are each potentially of different sizes, and it is annoying (though possible) to "batch prove" differently-sized grand products together. However, a relatively easy way to get down to 3 grand prodcuts is to set the memory size in each of the three categories above to equal the number of reads/writes. This simply involves padding the memory with zeros to make it equal in size to the number of reads/writes into the memory (i.e., NUM_CYCLES). Doing this will not substantially increase prover time.

One way to achieve zero-knowledge is to simply compose Jolt with a zero-knowledge SNARK like Groth16. That is, use Groth16 (which is ZK) to prove that one knows a Jolt proof. Composing with Groth16 or Plonk is how most zkVMs get low on-chain verification costs anyway, and it also "adds" ZK. This approach is on Jolt's roadmap, although it will take some time to complete (as it requires representing the Jolt verifier in R1CS or Plonkish, which is a pain). 

A second way to achieve zero-knowledge is to combine Jolt with folding, which we will do regardless, in order to make the prover space independent of the number of RISC-V cycles being proven. As described in Section 7 of the latest version of the HyperNova paper, one can straightforwardly obtain zero-knowledge directly from folding, without composition with a zkSNARK like Groth16.

There are also ways to make Jolt zero-knowledge without invoking SNARK composition. For example, rendering sum-check-based SNARKs zero-knowledge without using composition was exactly the motivation for Zeromorph, which introduces a very efficient zero-knowledge variant of KZG commitments for multilinear polynomials.

If we use the Zeromorph polynomial commitment scheme, the commitment and any evaluation proof are hiding (they reveal nothing about the committed polynomial, and still give the verifier a commitment to the requested evaluation of the committed polynomial). One still needs to ensure that the various applications of the sum-check protocol in Jolt also do not leak any information about the witness. Here, techniques based on masking polynomials apply (see Section 13.3 of Proofs, Arguments, and Zero-Knowledge for a sketchy overview). However, the use of masking polynomials requires the prover to be able to commit to non-multilinear polynomials and hence introduce significant (but surmountable) issues.

A similar approach to achieving ZK also applies when using a hashing-based polynomial commitment scheme like Brakedown. Roughly, to randomize all values sent by the prover during the Brakedown evaluation proof, it suffices to pad the committed polynomial with sufficiently many random values. One does need to lightly modify the Jolt polynomial IOP to "ignore" these extra, random committed values.

A final technique to render all of the sum-checks ZK without SNARK composition is given in Hyrax (based on old work of Cramer and Damgård). Roughly, rather than the prover sending field elements "in the clear", it instead sends (blinded, hence hiding) Pedersen commitments to these field elements. And the verifier exploits homomorphism properties to confirm that the committed field elements would have passed all of the sum-check verifier's checks. See Section 13.2 of Proofs, Arguments, and Zero-Knowledge for additional discussion.

Estimated costs of on-chain verifier with no proof composition

We expect proof storage and verification to cost about 2 million gas, with proofs that are a couple of dozen KBs in size. A crude and speculative explanation of where this estimate comes from is below.

Details

Costs due to processing commitments

Once we fully implement batching of polynomial evaluation proofs, Jolt will only require one HyperKZG opening. That's something like (conservatively) scalar multiplications, and two pairings, where is the number of RISC-V cycles being proven.

On top of that, the verifier is going to do about 80 scalar multiplications for homomorphically combining various commitments.

Let's conservatively ballpark this at 150 scalar multiplications and two pairings in total for the verifier. Each scalar multiplication costs 6,000 gas and each pairing costs 45,000 gas. That's about 1 million gas total due to scalar multiplications and pairings.

Costs due to sum-check verification and multilinear extension evaluations

The Jolt verifier will invoke the sum-check protocol about ten times (as little as 5 if we fully batch these sum-checks, meaning run as many as possible in parallel combined with standard random-linear-combination techniques, but let's say ten as a conservative estimate since it will take us an annoying amount of engineering effort to maximally batch the sum-checks).

One of these sum-checks is from applying Spartan for R1CS, 5 or so are from using the hybrid grand product argument (hybrid between Thaler13/GKR and Quarks/Spartan, which gives a nice balance between prover time and verifier costs), one from batching polynomial evaluation queries, and one from doing an input/output-consistency check, and one from the "collation" sum-check in the Lasso lookup argument.

The total amount of data sent in these sum-checks (plus HyperKZG commitments and evaluation proof) should be a few dozen KBs. Indeed, each sum-check should be roughly 40 rounds, with at most three field elements sent per round (except for the collation sum-check, as the polynomial involved there has degree 6 or so in each variable). That's bytes of data in each invocation of sum-check, and another 7 KBs or so for HyperKZG commitments and the evaluation proof. In total, that's perhaps 40 KBs of data.

Sum-check verification is simply field work, let's say ~100 field ops total per sum-check (very crude estimate), plus associated Fiat-Shamir hashing. On top of all of that, the Jolt verifier does MLE evaluations for all the "subtables" used in Lasso.

All told we can ballpark the total number of field ops done by the verifier as about , and Fiat-Shamir-hashing at most 40 KBs of data.

The gas costs of this hashing, plus field work, plus storing the entire proof in CALLDATA, should be under million.

Combined with the ~1 million gas to perform scalar multiplications and pairings, we obtain an estimate of between 1.3 million and 2 million gas in total.

Groth16 Recursion

Jolt's verifier with no proof composition is expensive. We estimate about 2 million gas to verify within the EVM. Further, the short-term continuations plan causes a linear blowup depending on the count of monolithic trace chunks proven.

To solve these two issues we're aiming to add a configuration option to the Jolt prover with a post processing step, which creates a Groth16 proof of the Jolt verifier for constant proof size / cost (~280k gas on EVM) regardless of continuation chunk count or opening proof cost. This technique is industry standard. (An added benefit of this proof composition is that it will render Jolt proofs zero-knowledge).

We call directly representing the Jolt verifier (with HyperKZG polynomial commitment scheme) as constraints to then feeding those constraints into Groth16 "naive composition". Unfortunately, this naive procedure will result in over a hundred millions of constraints. Applying Groth16 to such a large constraint system will result in far more latency than we'd lik, (and may even be impossible over the BN254 scalar field because that field only supports FFTs of length . Below, we describe alternate ways forward.

Cost accounting for naive composition

In naive composition, the bulk of the constraints are devoted to implementing the ~150 scalar multiplications done by the Jolt verifier (or more accurately, ~150 scalar multiplications will be the dominant verifier cost once we finish all optimizations that are now in progress).

Each scalar multiplication costs about 400 group operations, each of which costs about 10 field multiplications, so that's about field multiplications. The real killer is that these field multiplications must be done non-natively in constraints, due to the fact that that BN254 does not have a pairing-friendly "sister curve" (i.e., a curve whose scalar field matches the BN254 base field). This means that each of the field multiplications costs thousands of constraints.

On top of the above, the two pairings done by the HyperKZG verifier, implemented non-natively in constraints, should cost about 6 million constraints.

We do expect advances in representing non-native field arithmetic in constraints to bring the above numbers down some, but not nearly enough to make naive composition sensible.

A first step to reducing gas costs

A simple first approach to reducing gas costs is to leave all scalar multiplications and HyperKZG-evaluation-proof-checking done by the Jolt verifier out of the composition, doing those directly on-chain.

This would reduce gas costs to slightly over 1 million, and ensure that Groth16 is run only on a few million constraints (perhaps even less, given upcoming advances in representing non-native field arithmetic in constraints).

A way forward

Here is our plan to bring on-chain Jolt verification down to a couple hundred thousand gas.

First, represent the Jolt verifier (with HyperKZG-over-BN254 as the polynomial commitment scheme) as an R1CS instance over the Grumpkin scalar field, and apply Spartan (with Hyrax-over-Grumpkin as the polynomial commitment) to this R1CS. This R1CS will only have a few hundred thousand constraints. This is because all scalar multiplications and pairings done by the Jolt verifier are native over Grumpkin's scalar field. (The constraint system is also highly uniform, which can benefit both the Spartan prover and verifier, though it will be so small that this is probably not worth the trouble)

The field operations done by the Jolt verifier in the various invocations of the sum-check protocol need to be represented non-natively in these R1CS constraints, but since there are only about 2,000 such field operations they still only cost perhaps 5 million constraints in total. See the Testudo paper for a similar approach and associated calculations. This constraint system may be big enough to be worth leveraging its uniformity when we apply Spartan to it.

We further expect upcoming advances in methods for addressing non-native field arithmetic (and/or more careful optimizations of the Jolt verifier) to bring this down to under 2 million constraints.

But the Spartan proof is still too big to post on-chain. So, second, represent the Spartan verifier as an R1CS instance over the BN254 scalar field, and apply Groth16 to this R1CS. This the proof posted and verified on-chain.

The Spartan verifier only does at most a couple of hundred field operations (since there's only two sum-check invocations in Spartan, and the second sum-check invocation can be especially tiny for highly uniform constraint systems) and scalar multiplications where is the number of columns (i.e., witness variables) in the R1CS instance. here will be on the order of 3,000, perhaps less. Each (128-bit) scalar multiplication (which are done natively in this R1CS) yields about constraints. So that's 6 million constraints being fed to Groth16.

The calculation above is slightly delicate, because running Groth16 on 6 million constraints would be fine for many settings, but, say, 50 million constraints would not be. We do think 6 million is about the right estimate for this approach, especially given that 2,000 field operations for the Jolt verifier is a conservative estimate.

(Note that some works have dramatically overestimated the number of constraints needed to represent the Jolt-with-HyperKZG verifier as on the order of 2 billion, perhaps because they considered directly composing Jolt-with-HyperKZG with Groth16, which necessitates representing the HyperKZG verifier with non-native field arithmetic. The two-step composition proposed here avoids this overhead, leading to just 2-5 million constraints.)

Details of the Jolt-with-HyperKZG verifier

The easiest way to understand the workload of the verifier circuit is to jump through the codebase starting at vm/mod.rs Jolt::verify(...). Verification can be split into 4 logical modules: instruction lookups, read-write memory, bytecode, r1cs.

Each of the modules do some combination of the following:

After recursively verifying sumcheck, the verifier needs to compare the claimed evaluation of the sumcheck operand at a random point : to their own evaluation of the polynomial at . Jolt does this with a combination of opening proofs over the constituent polynomials of and direct evaluations of the multi-linear extensions of those polynomials if they have sufficient structure.

Specifics

Polynomial opening proof verification

HyperKZG is currently the optimal commitment scheme for recursion due to the requirement of only 2-pairing operations per opening proof. Unfortunately non-native field arithmetic will always be expensive within a circuit.

There are two options:

  • Sumcheck and MLE evaluations using native arithmetic, pairing operations using non-native arithmetic
  • Sumcheck and MLE evaluations using non-native arithmetic, pairing operations using native arithmetic

We believe the latter is more efficient albeit unergonomic. Some of the details are worked out in this paper here.

Polynomial opening proof batching

Jolt requires tens of opening proofs across all constituent polynomials in all sumchecks. If we did these independently the verifier would be prohibitively expensive. Instead we batch all opening proofs for polynomials which share an evaluation point .

verify_instruction_lookups

Instruction lookups does two sumchecks described in more detail here. The first contains some complexity. The evaluation of the MLE of each of the instructions at the point spit out by sumcheck is computed directly by the verifier. The verifier is able to do this thanks to the property from Lasso that each table is SOS (decomposable).

The LassoSubtable trait is implemented for all subtables. LassoSubtable::evaluate_mle(r) computes the MLE of each subtable. The JoltInstruction trait combines a series of underlying subtables. The MLEs of these subtables are combined to an instruction MLE via JoltInstruction::combine_lookups(vals: &[F]). Finally each of the instruction MLEs are combined into a VM-wide lookup MLE via InstructionLookupsProof::combine_lookups(...).

The Groth16 verifier circuit would have to mimic this pattern. Implementing the MLE evaluation logic for each of the subtables, combination logic for each of the instructions, and combination logic to aggregate all instructions. It's possible that subtables / instructions will be added / removed in the future.

verify_r1cs

R1CS is a modified Spartan instance which runs two sumchecks and a single opening proof.

There are two difficult MLEs to evaluate:

  • – evaluations of the R1CS coefficient
  • – evaluation of the witness vector

The sections below are under-described in the wiki. We'll flush these out shortly. Assume this step comes last.

For we must leverage the uniformity to efficiently evaluate. This is under-described in the wiki, but we'll get to it ASAP.

The witness vector is comprised of all of the inputs to the R1CS circuit concatenated together in trace_length-sized chunks. All of these are committed independently and are checked via a batched opening proof.

Engineering Suggestions

The Jolt codebase is rapidly undergoing improvements to reduce prover and verifier costs as well as simplify abstractions. As a result, it's recommended that each section above be built in modules that are convenient to rewire. Each part should be incrementally testable and adjustable.

A concrete example of changes to expect: we currently require 5-10 opening proofs per Jolt execution. Even for HyperKZG this requires 10-20 pairings which is prohibitively expensive. We are working on fixing this via better batching.

Suggested plan of attack:

  1. Circuit for HyperKZG verification
  2. Circuit for batched HyperKZG verification
  3. Circuit for sumcheck verification
  4. Circuit for instruction/subtable MLE evaluations
  5. Circuit for Spartan verification

Precompiles

Precompiles are highly optimized SNARK gadgets which can be invoked from the high-level programming language of the VM user. These gadgets can be much more efficient for the prover than compiling down to the underlying ISA by exploiting the structure of the workload. In practice zkVMs use these for heavy cryptographic operations such as hash functions, signatures and other elliptic curve arithmetic.

By popular demand, Jolt will support these gadgets as well. The short term plan is to optimize for minimizing Jolt-core development resources rather than optimal prover speed.

Precompile support plan:

  1. RV32 library wrapping syscalls of supported libraries
  2. Tracer picks up syscalls, sets relevant flag bits and loads memory accordingly
  3. Individual (uniform) Spartan instance for each precompile, repeated over trace_length steps
  4. Jolt config includes which precompiles are supported (there is some non-zero prover / verifier cost to including an unused precompile)
  5. Survey existing hash / elliptic curve arithmetic R1CS arithmetizations. Prioritize efficiency and audits.
  6. Use circom-scotia to convert matrices into static files in the Jolt codebase
  7. Write a converter to uniformly repeat the constraints trace_length steps

See the documentation on Jolt's use/handling of sparse constraint systems for a detailed overview of how the Jolt proof machinery will incorporate these pre-compiled constraint systems.

TODO(sragss): How do we deal with memory and loading more than 64-bits of inputs to precompiles.

Prover space control

There is a major impediment to Jolt’s usefulness today: it consumes large amounts of memory, which limits the length of RISC-V executions that it can prove. Proving about 16 million RISC-V cycles consumes hundreds of GBs of space.

There are two known techniques for reducing the prover space in zkVMs. One is called “continuations” and inherently involves proof recursion. This is the approach taken by almost all zkVM projects today. The other technique is non-recursive: it is well-known that sum-check-based SNARKs in particular do not require recursion to keep the prover's space under control. Nonetheless, very few projects are pursuing this non-recursive approach (the only one we are aware of is Ligetron, which is not sum-check-based). We call these non-recursive techniques "the streaming prover".

Jolt will pursue both approaches to prover space control. Below, we provide more details on the two approaches.

More detail on continuations

Continuations work by breaking a large computation into “chunks”, proving each chunk (almost) independently, and recursively aggregating the proofs (i.e., proving one knows the proofs for each chunk).

Continuations come in two flavors: "brute-force recursion" and folding. In brute-force recursion, the proofs for different chunks are aggregated by having the prover prove it knows each proof. Roughly speaking, the verifier of each proof is represented as a circuit, and a SNARK is used to prove that the prover knows a satisfying assignment for each circuit.

In folding schemes, the "proof" for each chunk is actually only a "partial proof", in particular omitting any evaluation proofs for any committed polynomials. The partial proofs for each chunk are not explicitly checked by anyone. Rather, they are "aggregated" into a single partial proof, and that partial proof is then "completed" into a full SNARK proof. In folding schemes, the prover winds up recursively proving that it correctly aggregated the partial proofs. This has major performance benefits relative to "brute force recursion", because aggregating proofs is much simpler than checking them. Hence, proving aggregation was done correctly is much cheaper than proving full-fledged proof verification was done correctly.

Folding schemes only work with elliptic-curve-based commitment schemes (there have been some efforts to extend folding techniques to hashing-based commitment schemes but they introduce large overheads). So most zkVMs do not pursue folding today as they use hashing-based commitment schemes. Fortunately, Jolt today does use elliptic curves and hence folding as a means of prover space control is available to it. A sketchy overview of how to integrate folding into Jolt is provided here.

More detail on non-recursive space control

Brute-force recursion has the following downsides.

First, recursive proof aggregation imposes a lower bound of several GBs on prover space, because this is how much space is required to represent a SNARK verifier as a circuit and store that circuit. Second, the chunks are not completely independent, because one has to make sure that the state of the VM’s memory at the end of chunk equals the state of the VM’s memory at the start of chunk . Ensuring this is an engineering headache and adds some performance overhead. Third, recursion actually changes the statement being verified. No longer is the statement “I correctly ran the computer program on a witness and it produced a claimed output ”. Rather it is: “I know several proofs, one per chunk, each attesting that the relevant chunk was run correctly, and with the final chunk outputting .” Consequently, once proof recursion is used, it can actually be quite challenging for the public to tell that the right statement is being verified.

Accordingly, we expect the non-recursive approach to space control to lead to a prover with smaller space than the continuations approach, and to a much simpler system overall.

The non-recursive approach will also place the security of the SNARK on firmer footing: security of recursive SNARKs is heuristic, while non-recursive SNARKs like Jolt itself have unconditional security in idealized models such as the Random Oracle model (perhaps with additional assumptions such as hardness of discrete logarithms, depending on the choice of commitment scheme to use in Jolt).

Crucial to non-recursive space control is that Lasso, the lookup argument used in Jolt, does not require any sorting.

Planned partial steps toward continuations

As discussed above, today Jolt is a monolithic SNARK. RISC-V traces cannot be broken up, they must be proven monolithically or not at all. As a result, Jolt has a fixed maximum trace length that can be proved which is a function of the available RAM on the prover machine. Long term we'd like to solve this by implementing a streaming version of Jolt's prover such that prover RAM usage is tunable with minimal performance loss.

Short term we're going to solve this via a very primitive form of continuations we call monolithic chunking. The plan: Take a trace of length split it into chunks of size and prove each independently. is a function of the max RAM available to the prover.

For the direct on-chain verifier there will be a cost linear in to verify. We believe this short-term trade-off is worthwhile for usability until we can implement the (more complicated) streaming algorithms.

Specifics

A generic config parameter will be added to the Jolt struct called ContinuationConfig. At the highest level, before calling Jolt::prove the trace will be split into M chunks. Jolt::prove will be called on each and return RAM_final which can be fed into RAM_init during the next iteration of Jolt::prove. The output zerocheck will only be run for the final chunk.

References on non-recursive prover space control

  1. Verifiable computation using multiple provers. Andrew J. Blumberg, Justin Thaler, Victor Vu, and Michael Walfish. https://eprint.iacr.org/2014/846.
  2. Public-coin zero-knowledge arguments with (almost) minimal time and space overheads. AR Block, J Holmgren, A Rosen, RD Rothblum, P Soni. Theory of Cryptography Conference, 168-197.
  3. Gemini: Elastic SNARKs for diverse environments J Bootle, A Chiesa, Y Hu, M Orru. EUROCRYPT 2022
  4. On black-box constructions of time and space efficient sublinear arguments from symmetric-key primitives. Laasya Bangalore, Rishabh Bhadauria, Carmit Hazay & Muthuramakrishnan Venkitasubramaniam. TCC 2022.
  5. Ligetron: Lightweight Scalable End-to-End Zero-Knowledge Proofs. Post-Quantum ZK-SNARKs on a Browser. Ruihan Wang Carmit Hazay Muthuramakrishnan Venkitasubramaniam. 2024 IEEE Symposium on Security and Privacy (SP).

Technical plan

The plan to implement folding is simple, with a (very) sketchy overview provided below.

  1. Verifying Jolt proofs involves two procedures: verifying sum-check proofs, and folding polynomial evaluation claims for committed polynomials.
  2. Running Nova with BN254 as the primary curve, we can simply verify sum-check proofs natively.
  3. For folding polynomial evaluation claims, we will use the technique from Hypernova to continually fold linear claims about committed vectors. This requires one scalar multiplication in the verifier circuit per claim folded. We can offload that to Grumpkin via the technique from Cyclefold, or do it non-natively if the rest of the circuit is for verifying sum-check proofs dominates the size of the recursive constraint system regardless.
  4. There are some nuisances as to how to incorporate offline memory-checking techniques with folding, and in particular to make sure that the state of the VM's memory at the end of shard matches the state of memory at the start of shard . These are not terribly difficult to address, and the details will be fully fleshed out in an upcoming e-print.

Note that this plan does not require "non-uniform folding". The fact that there are many different primitive RISC-V instructions is handled by "monolithic Jolt". Folding is merely applied to accumluate many copies of the same claim, namely that a Jolt proof (minus any HyperKZG evaluation proof) was correctly verified.

Space cost estimates

If we shard the VM execution into chunks of RISC-V cycles each, then naively implemented we expect Jolt proving to require about 10 GBs of space. The bottleneck here is Spartan proving. The prover space cost for proving all the grand products in Jolt is less than that of Spartan proving, and the prover can compute the Spartan proof and more or less "wipe" memory before computing the grand product proofs.

Details of GB estimate

What dominates the space cost of the Spartan prover? Simply storing the three vectors , , and , where are the R1CS constraint matrices. In Jolt's R1CS instance there are fewer than constraints per cycle, and there is one entry of each of per constraint.

Naively implemented, the Spartan prove stores bytes (i.e., bits) per entry of . This is actually overkill as the entries of are all -bit values, but it's simplest to ignore this and treat them as arbitrary field elements.

Hence, the number of bytes required to store is bytes, which is about billion bytes, aka GBs.

The prover also has to store the witness vector , which has a similar number of entries as (about of them), but 's entries are all 32-bit data types and it's pretty easy to actually store these as bits each throughout the entire Spartan protocol. So should only contribute a few hundred MBs to prover space if the shard size is .

The extra space overhead of taking the Jolt verifier (minus polynomial evaluation proofs, which don't need to be provided or verified when using folding to accumulate such claims) and turning it into constraints as required to apply a recursive folding scheme will be well under MBs.

Anticipated proving speed

Even with naive approaches to folding (e.g., using Nova rather than HyperNova, and without fully optimizing the Jolt proof size, meaning not batching all the grand product proofs to the maximum extent possible, so that Jolt proofs are about KBs rather than KBs)), we expect prover time of "Jolt with folding" to be close to the prover time cost of "Jolt applied monolithically" to a single shard.

In more detail, the extra time spent by the Jolt-with-folding prover on recursively proving verification of Jolt proofs will be modest. With folding, for shards of cycles, the "extra work" the prover does to fold/recurse is simply commit to roughly one million extra field elements. But the Jolt prover has to commit to about 80 million (small) field elements anyway just to prove the cycles. So committing to the one million field elements doesn't have a huge impact on total prover time.

It is true that the 1 million extra field elements are random, and so roughly equivalent in terms of commitment time to 10 million "small" field elements. This still represents at most a ~13% increase in commitment costs for the Jolt prover.

In fact, for large enough shard sizes (perhaps size or so), the Jolt prover may be faster with folding than without it. This is because, with folding, the prover does not have to compute any HyperKZG evaluation proofs, except for a single HyperKZG evaluation proof computed for the 'final folded instance'. The savings from HyperKZG evaluation proofs may more than offset the extra work of committing to roughly one million extra field elements per shard.

Going below 10 GBs via smaller chunks

Projects focused on client-side proving may want space costs of 2 GBs (or even lower, as browsers often limit a single tab to 1 GB of space, especially on phones). One way to achieve that space bound is to use smaller chunks, of or cycles each rather than . This is slightly smaller than the chunk sizes used by STARK-based zkVMs like SP1 and RISC Zero. With chunks of this size, the Jolt-with-folding prover (again, naively implemented) will be somewhat slower per cycle than "monolithic Jolt", perhaps by a factor of up to two or three.

This time overhead can be reduced with additional engineering/optimizations. For example:

  • Maximally batching grand product proofs should reduce Jolt proof size (and hence cut the time spent on recursive proving) by a factor of .
  • For at least the first round of Spartan's sum-check, we could store only bits per entry of rather than 256, which could save another factor of two in space cost. Also, prover speed optimizations for applying sum-check to small values described in Bagad-Domb-Thaler can also lead to additional space improvements for Spartan. At some point, the space cost of computing grand product proofs would dominate the space cost of Spartan proving and it won't be worthwhile to cut Spartan proving space further.
  • We can lower the size of Jolt proofs, and hence the time spent on recursive proving, by a factor of 2x-4x by switching to the grand product argument from Quarks/Spartan. For details, see https://jolt.a16zcrypto.com/future/proof-size-breakdown.html.

    This would reduce the amount of extra field elements committed for "folding/recursion" from about 1 million down to a few hundred thousand or even lower. It would increase prover time "outside of recursion" since the Quarks/Spartan prover commits to random field elements, but for small shard sizes (i.e., very small prover space, around 1GB or so) it may lead to a faster prover overall.

If all of the above optimizations are implemented, Jolt-with-folding prover space could be as low as 1 GB-2 GB with the prover only 30%-60% slower than "monolithic Jolt" (i.e., Jolt with no recursion or space control for the prover at all).

Core contributors

  • Michael Zhu
  • Sam Ragsdale
  • Arasu Arun
  • Noah Citron
  • Justin Thaler
  • Srinath Setty

Formal verification efforts

  • Carl Kwan
  • Quang Dao