patternjavaModerate

Java and C++ stack implementations with dramatically different performance

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview stack java c++stackoverflow performance comparative-review

implementationsstackwithdramaticallyjavadifferentperformanceand

Problem

I'm developing a small programming language for use in any project I have where I feel a small scripting language could be used well. I've written two emulators for the language, one in C++ and one in Java. The C++ performs faster except for any recursion in the language, in which it suddenly performs terribly!

Here is the code from my language that I am running which runs in 1.5s on the Java emulator and 4s on the C++ emulator:

dec i = 0;
recurse();
print i;

def recurse:
    if i < 10000000:
        i++;
        recurse();
    return;

This compiles down into the following instructions:

% i = 0; #define the alias i as being variable 0;
# i needs to be initialised;
STOREINT i 0;
CALL @recurse; #call the label recurse, push this line onto the stack;
PRINTINTLN $i;
END; #end of program;

@recurse; #define the function recurse;
# if i is < 10000000 then inc i;
GEQ $i 10000000;
JCMP @return; #jump to the return statement
INC i;
CALL @recurse;
@return; 
RETURN; #return back to the line on the top of the stack;

The C++ implementation differs slightly from the Java implementation in that it uses an array of unions to store each "typeless" object, whereas Java just uses Object. Other than that, the code is almost identical (with C++ not needing to do any casting because from the instructions we can be sure which field in the union we are using, and therefore we access the correct field instead of casting the Object to the correct type in Java. C++ also doesn't need to use & 0xFF in order to fix sign problems with the java byte type, as we can make unsigned chars). The main difference between them seems to be the performance of the call stacks, which are implemented in each case as follows:

Java

```
public class ArrayStack implements Stack
{
private static final long serialVersionUID = 1L;
protected Object[] stack;
protected int capacity;
protected int pointer;

public ArrayStack(int capacity)
{
stack = new Obje

Solution

Performance

I'll get to a bit of a review, but first, since you're most interested in performance, I've done some benchmarking. In short, all evidence points towards maaartinus' theory being the cause.

Benchmark.java:

public class Benchmark {

    private static class Stored {
        private final int[] big = new int[250];

        public Stored() { }
    }

    public static void main(String[] args) {

        // Create the objects up front so we don't include allocation/construction time in the benchmark
        Stored stored[] = new Stored[1000000];
        for (int i = 0; i  stack = new ArrayStack(1000000);
            for (int i = 0; i < 1000000; ++i) { stack.push(stored[i]); }
            for (int i = 0; i < 1000000; ++i) { stack.pop(); }

            long end = System.nanoTime();

            if (!stack.isEmpty()) { System.out.println("This is just here to make sure nothing gets optimized out."); }

            System.out.printf("Time: %f seconds%n", (end - start) / 1e9);
        }   
    }
}

Once the JVM has warmed up and some JITing has (presumably) happened, each test takes 0.0038 seconds (with a very small variance). If the size of big in Stored is changed, this doesn't change timings since the stack stuff is still just copying a reference. Even if more references are added to Stored, this increasing the size of stored, there is still no time increase. This all makes sense: all Java has to do is copy a pointer, and the size of this pointer is fixed.

benchmark.cpp

int main() {

    // To be consistent with the Java approach, we'll create the objects up front. Noteably, this doesn't actually
    // matter since the copying of the objects is just as expensive as creation.
    // (Note: typically it wouldn't make sense for this to be dynamically allocated, but 1000000 objects was too big for
    // the stack.)
    Stored* stored = new Stored[1000000]();

    // Do the simple approach
    for (int n = 0; n  stack(1000000);
        for (int i = 0; i  diff = end - start;
        std::cout << "Time: " << diff.count() << " seconds\n";
    }
}

This was run with a few different Storeds. The first was a relatively large class:

struct Stored {
    int x[250];
};

The second was a very light class:

struct Stored {
    int x[1];
};

As a third option, the large class was kept, but rather than storing a value, a pointer was stored in the stack:

for (int i = 0; i < 1000000; ++i) { stack.push(&stored[i]); }

The timings were 0.30 seconds, 0.001 seconds, and 0.001 seconds for options 1, 2, and 3, repsectively. This is consistent with the Java theory and findings: C++ is peforming expensive copies that Java is not doing.

Benchmark Conclusions

The question now obviously becomes how to optimize your C++, and unfortunatley there's no simple solution to that without knowing a lot more about what exactly it is that you're putting into your stack. Move semantics are one option, provided your copying is expensive and you'd be happy to sacrifice the object into the stack rather than copy it.

The other option is to move away from value semantics so that copies become cheaper. If you have simple hierarchical ownership semantics (e.g. the objects being stored in the stack will always outlive the stack), this is super easy: just store pointers in the stack instead of objects. If your ownership isn't quite so simple, you would need to consider using something like heap allocation and a smart pointer. Both of these approaches are trade offs though. Smart pointers are safe but relatively expensive to copy, and raw pointers can quickly become a confusing mess if you're not very particular and consistent with your semantics. In both cases you no longer have the pleasantness of value semantics.

Code Snippets

public class Benchmark {

    private static class Stored {
        private final int[] big = new int[250];

        public Stored() { }
    }

    public static void main(String[] args) {

        // Create the objects up front so we don't include allocation/construction time in the benchmark
        Stored stored[] = new Stored[1000000];
        for (int i = 0; i < stored.length; ++i) { stored[i] = new Stored(); }

        for (int n = 0; n < 100; ++n) {
            long start = System.nanoTime();

            ArrayStack<Stored> stack = new ArrayStack<Stored>(1000000);
            for (int i = 0; i < 1000000; ++i) { stack.push(stored[i]); }
            for (int i = 0; i < 1000000; ++i) { stack.pop(); }

            long end = System.nanoTime();

            if (!stack.isEmpty()) { System.out.println("This is just here to make sure nothing gets optimized out."); }

            System.out.printf("Time: %f seconds%n", (end - start) / 1e9);
        }   
    }
}

int main() {

    // To be consistent with the Java approach, we'll create the objects up front. Noteably, this doesn't actually
    // matter since the copying of the objects is just as expensive as creation.
    // (Note: typically it wouldn't make sense for this to be dynamically allocated, but 1000000 objects was too big for
    // the stack.)
    Stored* stored = new Stored[1000000]();

    // Do the simple approach
    for (int n = 0; n < 100; ++n) {
        auto start = std::chrono::high_resolution_clock::now();

        ArrayStack<Stored> stack(1000000);
        for (int i = 0; i < 1000000; ++i) { stack.push(stored[i]); }
        for (int i = 0; i < 1000000; ++i) { stack.pop(); }

        if (!stack.isEmpty()) { std::cout << "This is here so the code doesn't get optimized out.\n"; }

        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> diff = end - start;
        std::cout << "Time: " << diff.count() << " seconds\n";
    }
}

struct Stored {
    int x[250];
};

struct Stored {
    int x[1];
};

for (int i = 0; i < 1000000; ++i) { stack.push(&stored[i]); }

Context

StackExchange Code Review Q#92833, answer score: 13

Revisions (0)

No revisions yet.