There are several issues going on here in parallel, as it were.

The first is that solving a problem in parallel always involves performing more actual work than doing it sequentially. Overhead is involved in splitting the work among several threads and joining or merging the results. Problems like converting short strings to lower-case are small enough that they are in danger of being swamped by the parallel splitting overhead.

The second issue is that benchmarking Java program is very subtle, and it is very easy to get confusing results. Two common issues are JIT compilation and dead code elimination. Short benchmarks often finish before or during JIT compilation, so they're not measuring peak throughput, and indeed they might be measuring the JIT itself. When compilation occurs is somewhat non-deterministic, so it may cause results to vary wildly as well.

For small, synthetic benchmarks, the workload often computes results that are thrown away. JIT compilers are quite good at detecting this and eliminating code that doesn't produce results that are used anywhere. This probably isn't happening in this case, but if you tinker around with other synthetic workloads, it can certainly happen. Of course, if the JIT eliminates the benchmark workload, it renders the benchmark useless.

I strongly recommend using a well-developed benchmarking framework such as JMH instead of hand-rolling one of your own. JMH has facilities to help avoid common benchmarking pitfalls, including these, and it's pretty easy to set up and run. Here's your benchmark converted to use JMH:

package com.stackoverflow.questions;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.*;

public class SO23170832 {
    @State(Scope.Benchmark)
    public static class BenchmarkState {
        static String[] array;
        static {
            array = new String[1000000];
            Arrays.fill(array, "AbabagalamagA");
        }
    }

    @GenerateMicroBenchmark
    @OutputTimeUnit(TimeUnit.SECONDS)
    public List<String> sequential(BenchmarkState state) {
        return
            Arrays.stream(state.array)
                  .map(x -> x.toLowerCase())
                  .collect(Collectors.toList());
    }

    @GenerateMicroBenchmark
    @OutputTimeUnit(TimeUnit.SECONDS)
    public List<String> parallel(BenchmarkState state) {
        return
            Arrays.stream(state.array)
                  .parallel()
                  .map(x -> x.toLowerCase())
                  .collect(Collectors.toList());
    }
}

I ran this using the command:

java -jar dist/microbenchmarks.jar ".*SO23170832.*" -wi 5 -i 5 -f 1

(The options indicate five warmup iterations, five benchmark iterations, and one forked JVM.) During its run, JMH emits lots of verbose messages, which I've elided. The summary results are as follows.

Benchmark                       Mode   Samples         Mean   Mean error    Units
c.s.q.SO23170832.parallel      thrpt         5        4.600        5.995    ops/s
c.s.q.SO23170832.sequential    thrpt         5        1.500        1.727    ops/s

Note that results are in ops per second, so it looks like the parallel run was about three times faster than the sequential run. But my machine has only two cores. Hmmm. And the mean error per run is actually larger than the mean runtime! WAT? Something fishy is going on here.

This brings us to a third issue. Looking more closely at the workload, we can see that it allocates a new String object for each input, and it also collects the results into a list, which involves lots of reallocation and copying. I'd guess that this will result in a fair amount of garbage collection. We can see this by rerunning the benchmark with GC messages enabled:

java -verbose:gc -jar dist/microbenchmarks.jar ".*SO23170832.*" -wi 5 -i 5 -f 1

This gives results like:

[GC (Allocation Failure)  512K->432K(130560K), 0.0024130 secs]
[GC (Allocation Failure)  944K->520K(131072K), 0.0015740 secs]
[GC (Allocation Failure)  1544K->777K(131072K), 0.0032490 secs]
[GC (Allocation Failure)  1801K->1027K(132096K), 0.0023940 secs]
# Run progress: 0.00% complete, ETA 00:00:20
# VM invoker: /Users/src/jdk/jdk8-b132.jdk/Contents/Home/jre/bin/java
# VM options: -verbose:gc
# Fork: 1 of 1
[GC (Allocation Failure)  512K->424K(130560K), 0.0015460 secs]
[GC (Allocation Failure)  933K->552K(131072K), 0.0014050 secs]
[GC (Allocation Failure)  1576K->850K(131072K), 0.0023050 secs]
[GC (Allocation Failure)  3075K->1561K(132096K), 0.0045140 secs]
[GC (Allocation Failure)  1874K->1059K(132096K), 0.0062330 secs]
# Warmup: 5 iterations, 1 s each
# Measurement: 5 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.stackoverflow.questions.SO23170832.parallel
# Warmup Iteration   1: [GC (Allocation Failure)  7014K->5445K(132096K), 0.0184680 secs]
[GC (Allocation Failure)  7493K->6346K(135168K), 0.0068380 secs]
[GC (Allocation Failure)  10442K->8663K(135168K), 0.0155600 secs]
[GC (Allocation Failure)  12759K->11051K(139776K), 0.0148190 secs]
[GC (Allocation Failure)  18219K->15067K(140800K), 0.0241780 secs]
[GC (Allocation Failure)  22167K->19214K(145920K), 0.0208510 secs]
[GC (Allocation Failure)  29454K->25065K(147456K), 0.0333080 secs]
[GC (Allocation Failure)  35305K->30729K(153600K), 0.0376610 secs]
[GC (Allocation Failure)  46089K->39406K(154624K), 0.0406060 secs]
[GC (Allocation Failure)  54766K->48299K(164352K), 0.0550140 secs]
[GC (Allocation Failure)  71851K->62725K(165376K), 0.0612780 secs]
[GC (Allocation Failure)  86277K->74864K(184320K), 0.0649210 secs]
[GC (Allocation Failure)  111216K->94203K(185856K), 0.0875710 secs]
[GC (Allocation Failure)  130555K->114932K(199680K), 0.1030540 secs]
[GC (Allocation Failure)  162548K->141952K(203264K), 0.1315720 secs]
[Full GC (Ergonomics)  141952K->59696K(159232K), 0.5150890 secs]
[GC (Allocation Failure)  105613K->85547K(184832K), 0.0738530 secs]
1.183 ops/s

Note: the lines beginning with # are normal JMH output lines. All the rest are GC messages. This is just the first of the five warmup iterations, which precedes five benchmark iterations. The GC messages continued in the same vein during the rest of the iterations. I think it's safe to say that the measured performance is dominated by GC overhead and that the results reported should not be believed.

At this point it's unclear what to do. This is purely a synthetic workload. It clearly involves very little CPU time doing actual work compared to allocation and copying. It's hard to say what you really are trying to measure here. One approach would be to come up with a different workload that is in some sense more "real." Another approach would be to change the heap and GC parameters to avoid GC during the benchmark run.

Answer from Stuart Marks on Stack Overflow
Top answer
1 of 4
178

There are several issues going on here in parallel, as it were.

The first is that solving a problem in parallel always involves performing more actual work than doing it sequentially. Overhead is involved in splitting the work among several threads and joining or merging the results. Problems like converting short strings to lower-case are small enough that they are in danger of being swamped by the parallel splitting overhead.

The second issue is that benchmarking Java program is very subtle, and it is very easy to get confusing results. Two common issues are JIT compilation and dead code elimination. Short benchmarks often finish before or during JIT compilation, so they're not measuring peak throughput, and indeed they might be measuring the JIT itself. When compilation occurs is somewhat non-deterministic, so it may cause results to vary wildly as well.

For small, synthetic benchmarks, the workload often computes results that are thrown away. JIT compilers are quite good at detecting this and eliminating code that doesn't produce results that are used anywhere. This probably isn't happening in this case, but if you tinker around with other synthetic workloads, it can certainly happen. Of course, if the JIT eliminates the benchmark workload, it renders the benchmark useless.

I strongly recommend using a well-developed benchmarking framework such as JMH instead of hand-rolling one of your own. JMH has facilities to help avoid common benchmarking pitfalls, including these, and it's pretty easy to set up and run. Here's your benchmark converted to use JMH:

package com.stackoverflow.questions;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.*;

public class SO23170832 {
    @State(Scope.Benchmark)
    public static class BenchmarkState {
        static String[] array;
        static {
            array = new String[1000000];
            Arrays.fill(array, "AbabagalamagA");
        }
    }

    @GenerateMicroBenchmark
    @OutputTimeUnit(TimeUnit.SECONDS)
    public List<String> sequential(BenchmarkState state) {
        return
            Arrays.stream(state.array)
                  .map(x -> x.toLowerCase())
                  .collect(Collectors.toList());
    }

    @GenerateMicroBenchmark
    @OutputTimeUnit(TimeUnit.SECONDS)
    public List<String> parallel(BenchmarkState state) {
        return
            Arrays.stream(state.array)
                  .parallel()
                  .map(x -> x.toLowerCase())
                  .collect(Collectors.toList());
    }
}

I ran this using the command:

java -jar dist/microbenchmarks.jar ".*SO23170832.*" -wi 5 -i 5 -f 1

(The options indicate five warmup iterations, five benchmark iterations, and one forked JVM.) During its run, JMH emits lots of verbose messages, which I've elided. The summary results are as follows.

Benchmark                       Mode   Samples         Mean   Mean error    Units
c.s.q.SO23170832.parallel      thrpt         5        4.600        5.995    ops/s
c.s.q.SO23170832.sequential    thrpt         5        1.500        1.727    ops/s

Note that results are in ops per second, so it looks like the parallel run was about three times faster than the sequential run. But my machine has only two cores. Hmmm. And the mean error per run is actually larger than the mean runtime! WAT? Something fishy is going on here.

This brings us to a third issue. Looking more closely at the workload, we can see that it allocates a new String object for each input, and it also collects the results into a list, which involves lots of reallocation and copying. I'd guess that this will result in a fair amount of garbage collection. We can see this by rerunning the benchmark with GC messages enabled:

java -verbose:gc -jar dist/microbenchmarks.jar ".*SO23170832.*" -wi 5 -i 5 -f 1

This gives results like:

[GC (Allocation Failure)  512K->432K(130560K), 0.0024130 secs]
[GC (Allocation Failure)  944K->520K(131072K), 0.0015740 secs]
[GC (Allocation Failure)  1544K->777K(131072K), 0.0032490 secs]
[GC (Allocation Failure)  1801K->1027K(132096K), 0.0023940 secs]
# Run progress: 0.00% complete, ETA 00:00:20
# VM invoker: /Users/src/jdk/jdk8-b132.jdk/Contents/Home/jre/bin/java
# VM options: -verbose:gc
# Fork: 1 of 1
[GC (Allocation Failure)  512K->424K(130560K), 0.0015460 secs]
[GC (Allocation Failure)  933K->552K(131072K), 0.0014050 secs]
[GC (Allocation Failure)  1576K->850K(131072K), 0.0023050 secs]
[GC (Allocation Failure)  3075K->1561K(132096K), 0.0045140 secs]
[GC (Allocation Failure)  1874K->1059K(132096K), 0.0062330 secs]
# Warmup: 5 iterations, 1 s each
# Measurement: 5 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.stackoverflow.questions.SO23170832.parallel
# Warmup Iteration   1: [GC (Allocation Failure)  7014K->5445K(132096K), 0.0184680 secs]
[GC (Allocation Failure)  7493K->6346K(135168K), 0.0068380 secs]
[GC (Allocation Failure)  10442K->8663K(135168K), 0.0155600 secs]
[GC (Allocation Failure)  12759K->11051K(139776K), 0.0148190 secs]
[GC (Allocation Failure)  18219K->15067K(140800K), 0.0241780 secs]
[GC (Allocation Failure)  22167K->19214K(145920K), 0.0208510 secs]
[GC (Allocation Failure)  29454K->25065K(147456K), 0.0333080 secs]
[GC (Allocation Failure)  35305K->30729K(153600K), 0.0376610 secs]
[GC (Allocation Failure)  46089K->39406K(154624K), 0.0406060 secs]
[GC (Allocation Failure)  54766K->48299K(164352K), 0.0550140 secs]
[GC (Allocation Failure)  71851K->62725K(165376K), 0.0612780 secs]
[GC (Allocation Failure)  86277K->74864K(184320K), 0.0649210 secs]
[GC (Allocation Failure)  111216K->94203K(185856K), 0.0875710 secs]
[GC (Allocation Failure)  130555K->114932K(199680K), 0.1030540 secs]
[GC (Allocation Failure)  162548K->141952K(203264K), 0.1315720 secs]
[Full GC (Ergonomics)  141952K->59696K(159232K), 0.5150890 secs]
[GC (Allocation Failure)  105613K->85547K(184832K), 0.0738530 secs]
1.183 ops/s

Note: the lines beginning with # are normal JMH output lines. All the rest are GC messages. This is just the first of the five warmup iterations, which precedes five benchmark iterations. The GC messages continued in the same vein during the rest of the iterations. I think it's safe to say that the measured performance is dominated by GC overhead and that the results reported should not be believed.

At this point it's unclear what to do. This is purely a synthetic workload. It clearly involves very little CPU time doing actual work compared to allocation and copying. It's hard to say what you really are trying to measure here. One approach would be to come up with a different workload that is in some sense more "real." Another approach would be to change the heap and GC parameters to avoid GC during the benchmark run.

2 of 4
19

When doing benchmarks, you should pay attention to the JIT compilation, and that timing behaviors can change, based on the amount of JIT compiled code paths. If I add a warm-up phase to your test program, the parallel version is bit a faster than the sequential version. Here are the results:

Warmup...
Benchmark...
Run 0:  sequential 0.12s  -  parallel 0.11s
Run 1:  sequential 0.13s  -  parallel 0.08s
Run 2:  sequential 0.15s  -  parallel 0.08s
Run 3:  sequential 0.12s  -  parallel 0.11s
Run 4:  sequential 0.13s  -  parallel 0.08s

The following code fragment contains the complete source code that I have used for this test.

public static void main(String... args) {
    String[] array = new String[1000000];
    Arrays.fill(array, "AbabagalamagA");
    System.out.println("Warmup...");
    for (int i = 0; i < 100; ++i) {
        sequential(array);
        parallel(array);
    }
    System.out.println("Benchmark...");
    for (int i = 0; i < 5; ++i) {
        System.out.printf("Run %d:  sequential %s  -  parallel %s\n",
            i,
            test(() -> sequential(array)),
            test(() -> parallel(array)));
    }
}
private static void sequential(String[] array) {
    Arrays.stream(array).map(String::toLowerCase).collect(Collectors.toList());
}
private static void parallel(String[] array) {
    Arrays.stream(array).parallel().map(String::toLowerCase).collect(Collectors.toList());
}
private static String test(Runnable runnable) {
    long start = System.currentTimeMillis();
    runnable.run();
    long elapsed = System.currentTimeMillis() - start;
    return String.format("%4.2fs", elapsed / 1000.0);
}
🌐
DZone
dzone.com › coding › java › what's wrong in java 8, part iii: streams and parallel streams
What's Wrong in Java 8, Part III: Streams and Parallel Streams
May 20, 2014 - There are not many threads running at the same time, and in particular no other parallel stream. If all subtasks imply intense calculation, the potential gain is limited by the number of available processors. Java 8 will by default use as many threads as they are processors on the computer, so, for intensive tasks, the result is highly dependent upon what other threads may be doing at the same time.
Discussions

java - Should I always use a parallel stream when possible? - Stack Overflow
With Java 8 and lambdas, it's easy to iterate over collections as streams, and just as easy to use a parallel stream. Two examples from the documentation, the second one using parallelStream: More on stackoverflow.com
🌐 stackoverflow.com
multithreading - Difference between java 8 streams and parallel streams - Stack Overflow
I wrote code using Java 8 streams and parallel streams for the same functionality with a custom collector to perform an aggregation function. When I see CPU usage using htop, it shows all CPU cores... More on stackoverflow.com
🌐 stackoverflow.com
A surprising pain point regarding Parallel Java Streams (featuring mailing list discussion with Viktor Klang).
I did want to follow up about one point Viktor made later on in the conversation. https://mail.openjdk.org/pipermail/core-libs-dev/2024-November/134542.html And here is the quote. In a potential future where all intermediate operations are Gatherer-based, and all terminal operations are Collector-based, it would just work as expected. But with that said, I'm not sure it is practically achievable because some operations might not have the same performance-characteristics as before. Me personally, I would GLADLY accept a flag on stream (similar to parallel() or unordered()) that would allow me to guarantee that my stream never pre-fetches, even if I take a massive performance hit. If that can be accomplished by making all intermediate operations be implemented by a Gatherer under the hood, that is A-OK with me. The reality is, not all streams are compute bound. Some are IO bound, but are otherwise, a great fit for streams. Having a method that allows us to optimize for that fact is a new type of performance enhancement that I would greatly appreciate, even if it degrades performance in other ways. More on reddit.com
🌐 r/java
94
223
November 20, 2024
Alternating between Java streams and parallel streams at runtime - Software Engineering Stack Exchange
There will be times where the volume ... from parallelizing the workload. Then there are times where a single thread will accomplish the task much more efficiently. I have profiled my application a dozen times and have had mixed results. So this brings me to my question. Is there a way in Java 8 (or later) to switch between stream() and ... More on softwareengineering.stackexchange.com
🌐 softwareengineering.stackexchange.com
🌐
Quora
quora.com › Is-there-any-disadvantage-of-always-using-parallel-stream-instead-of-stream-while-performing-operations-on-collections-in-Java
Is there any disadvantage of always using parallel stream instead of stream while performing operations on collections in Java? - Quora
Parallel streams use the common pool, which other libraries/frameworks may also use, causing interference. In Java 8-17 defaults vary; thread counts depend on available processors and can be changed by system property, which affects global behavior.
🌐
Vanderbilt
dre.vanderbilt.edu › ~schmidt › cs891f › 2018-PDFs › 20-pros-and-cons-of-Java-8-parallel-streams.pdf pdf
Pros & Cons of Java 8 Parallel Streams Douglas C. Schmidt
Converting sequential to parallel streams only require minuscule changes! ... The Java 8 parallel streams framework is not all unicorns & rainbows!!
🌐
DZone
dzone.com › coding › java › think twice before using java 8 parallel streams
Think Twice Before Using Java 8 Parallel Streams
August 13, 2019 - The other option is to not use parallel streams and wait until Oracle allows us to specify the thread pool to be used for parallel streams. ... If you enjoyed this article and want to learn more about Java Streams, check out this collection of tutorials and articles on all things Java Streams.
🌐
Medium
medium.com › @mesfandiari77 › parallel-stream-in-java-ac47c54176e0
Parallel Stream in java. Parallel Stream is a feature introduced… | by MEsfandiari | Medium
June 22, 2023 - It’s important to note that using Parallel Stream should be done with care and caution as incorrect usage can lead to issues such as race conditions and deadlocks. It’s important to be aware of the advantages and disadvantages of Parallel ...
Top answer
1 of 6
910

A parallel stream has a much higher overhead compared to a sequential one. Coordinating the threads takes a significant amount of time. I would use sequential streams by default and only consider parallel ones if

  • I have a massive amount of items to process (or the processing of each item takes time and is parallelizable)

  • I have a performance problem in the first place

  • I don't already run the process in a multithread environment (for example: in a web container, if I already have many requests to process in parallel, adding an additional layer of parallelism inside each request could have more negative than positive effects)

In your example, the performance will anyway be driven by the synchronized access to System.out.println(), and making this process parallel will not have any effect, or even a negative one.

Moreover, remember that parallel streams don't magically solve all the synchronization problems. If a shared resource is used by the predicates and functions used in the process, you'll have to make sure that everything is threadsafe. In particular, side effects are things you really have to worry about if you go parallel.

In any case, measure, don't guess! Only a measurement will tell you if the parallelism is worth it or not.

2 of 6
316

The Stream API was designed to make it easy to write computations in a way that was abstracted away from how they would be executed, making switching between sequential and parallel easy.

However, just because it’s easy, doesn't mean it’s always a good idea, and in fact, it is a bad idea to just drop .parallel() all over the place simply because you can.

First, note that parallelism offers no benefits other than the possibility of faster execution when more cores are available. A parallel execution will always involve more work than a sequential one, because in addition to solving the problem, it also has to perform dispatching and coordinating of sub-tasks. The hope is that you'll be able to get to the answer faster by breaking up the work across multiple processors; whether this actually happens depends on a lot of things, including the size of your data set, how much computation you are doing on each element, the nature of the computation (specifically, does the processing of one element interact with processing of others?), the number of processors available, and the number of other tasks competing for those processors.

Further, note that parallelism also often exposes nondeterminism in the computation that is often hidden by sequential implementations; sometimes this doesn't matter, or can be mitigated by constraining the operations involved (i.e., reduction operators must be stateless and associative.)

In reality, sometimes parallelism will speed up your computation, sometimes it will not, and sometimes it will even slow it down. It is best to develop first using sequential execution and then apply parallelism where

(A) you know that there's actually benefit to increased performance and

(B) that it will actually deliver increased performance.

(A) is a business problem, not a technical one. If you are a performance expert, you'll usually be able to look at the code and determine (B), but the smart path is to measure. (And, don't even bother until you're convinced of (A); if the code is fast enough, better to apply your brain cycles elsewhere.)

The simplest performance model for parallelism is the "NQ" model, where N is the number of elements, and Q is the computation per element. In general, you need the product NQ to exceed some threshold before you start getting a performance benefit. For a low-Q problem like "add up numbers from 1 to N", you will generally see a breakeven between N=1000 and N=10000. With higher-Q problems, you'll see breakevens at lower thresholds.

But the reality is quite complicated. So until you achieve experthood, first identify when sequential processing is actually costing you something, and then measure if parallelism will help.

Find elsewhere
🌐
Baeldung
baeldung.com › home › java › java streams › when to use a parallel stream in java
When to Use a Parallel Stream in Java | Baeldung
November 10, 2025 - A solid grasp of the fundamentals will go a long way to help minimize these issues. Get started with understanding multi-threaded applications with our Java Concurrency guide: ... Since its introduction in Java 8, the Stream API has become a staple of Java development.
Top answer
1 of 2
61

Consider the following program:

import java.util.ArrayList;
import java.util.List;

public class Foo {
    public static void main(String... args) {
        List<Integer> list = new ArrayList<>();
        for (int i = 0; i < 1000; i++) {
            list.add(i);
        }
        list.stream().forEach(System.out::println);
    }
}

You will notice that this program will output the numbers from 0 to 999 sequentially, in the order in which they are in the list. If we change stream() to parallelStream() this is not the case anymore (at least on my computer): all number are written, but in a different order. So, apparently, parallelStream() indeed uses multiple threads.

The htop is explained by the fact that even single-threaded applications are divided over mutliple cores by most modern operating systems (parts of the same thread may run on several cores, but of course not at the same time). So if you see that a process used more than one core, this does not mean necessarily that the program uses multiple threads.

Also the performance may not improve when using multiple threads. The cost of synchronization may nihilite the gains of using multiple threads. For simple testing scenarios this is often the case. For example, in the above example, System.out is synchronized. So, effectively, only number can be written at the same time, although multiple threads are used.

2 of 2
3

adding to @Hoopje 's answer:

Before using parallelStream (), Read this:

  1. It is multi-threaded. Just writing parallelStream() to get parallelism is almost always bad idea in java. There are some cases where it will work, but not always. There are other ways to achieve parallelism and almost always, you need to think a lot before taking a multi-thread solution .
  2. It uses the default JVM thread pool. So, if you are doing any blocking operation such as network call, the entire java application can get stuck. Thats the biggest problem there. There are other ones with task allocation as well. A simple ExecutionService with n threads provides better performance that parallel streams.

You can also read: Java Parallel Streams Are Bad for Your Health! | JRebel by Perforce

🌐
Medium
medium.com › @vino7tech › difference-between-stream-and-parallel-stream-in-java-8-0c20004706d2
Difference between Stream and Parallel Stream in Java 8 | by Vinotech | Medium
September 28, 2024 - The stream is split into substreams, which are processed concurrently. There is no guarantee of the order of execution, meaning results may not always follow the order of the original data.
🌐
Oracle
docs.oracle.com › javase › tutorial › collections › streams › parallelism.html
Parallelism (The Java™ Tutorials > Collections > Aggregate Operations)
Operations like forEach and peek are designed for side effects; a lambda expression that returns void, such as one that invokes System.out.println, can do nothing but have side effects. Even so, you should use the forEach and peek operations with care; if you use one of these operations with a parallel stream, then the Java runtime may invoke the lambda expression that you specified as its parameter concurrently from multiple threads.
🌐
JRebel
jrebel.com › blog › parallel-java-streams
Take Caution Using Java Parallel Streams | JRebel by Perforce
July 3, 2014 - Parallel streams in Java can speed up some tasks. However, it can also result in latency increases throughout the system. See how to prevent latency, here.
🌐
Reddit
reddit.com › r/java › a surprising pain point regarding parallel java streams (featuring mailing list discussion with viktor klang).
r/java on Reddit: A surprising pain point regarding Parallel Java Streams (featuring mailing list discussion with Viktor Klang).
November 20, 2024 -

First off, apologies for being AWOL. Been (and still am) juggling a lot of emergencies, both work and personal.

My team was in crunch time to respond to a pretty ridiculous client ask. In order to get things in in time, we had to ignore performance, and kind of just took the "shoot first, look later" approach. We got surprisingly lucky, except in one instance where we were using Java Streams.

It was a seemingly simple task -- download a file, split into several files based on an attribute, and then upload those split files to a new location.

But there is one catch -- both the input and output files were larger than the amount of RAM and hard disk available on the machine. Or at least, I was told to operate on that assumption when developing a solution.

No problem, I thought. We can just grab the file in batches and write out the batches.

This worked out great, but the performance was not good enough for what we were doing. In my overworked and rushed mind, I thought it would be a good idea to just turn on parallelism for that stream. That way, we could run N times faster, according to the number of cores on that machine, right?

Before I go any further, this is (more or less) what the stream looked like.

try (final Stream<String> myStream = SomeClass.openStream(someLocation)) {
    myStream
        .parallel()
        //insert some intermediate operations here
        .gather(Gatherers.windowFixed(SOME_BATCH_SIZE))
        //insert some more intermediate operations here
        .forEach(SomeClass::upload)
        ;
}

So, running this sequentially, it worked just fine on both smaller and larger files, albeit, slower than we needed.

So I turned on parallelism, ran it on a smaller file, and the performance was excellent. Exactly what we wanted.

So then I tried running a larger file in parallel.

OutOfMemoryError

I thought, ok, maybe the batch size is too large. Dropped it down to 100k lines (which is tiny in our case).

OutOfMemoryError

Getting frustrated, I dropped my batch size down to 1 single, solitary line.

OutOfMemoryError

Losing my mind, I boiled down my stream to the absolute minimum possible functionality possible to eliminate any chance of outside interference. I ended up with the following stream.

final AtomicLong rowCounter = new AtomicLong();
myStream
    .parallel()
    //no need to batch because I am literally processing this file each line at a time, albeit, in parallel.
    .forEach(eachLine -> {
        final long rowCount = rowCounter.getAndIncrement();
        if (rowCount % 1_000_000 == 0) { //This will log the 0 value, so I know when it starts.
            System.out.println(rowCount);
        }
    })
    ;

And to be clear, I specifically designed that if statement so that the 0 value would be printed out. I tested it on a small file, and it did exactly that, printing out 0, 1000000, 2000000, etc.

And it worked just fine on both small and large files when running sequentially. And it worked just fine on a small file in parallel too.

Then I tried a larger file in parallel.

OutOfMemoryError

And it didn't even print out the 0. Which means, it didn't even process ANY of the elements AT ALL. It just fetched so much data and then died without hitting any of the pipeline stages.

At this point, I was furious and panicking, so I just turned my original stream sequential and upped my batch size to a much larger number (but still within our RAM requirements). This ended up speeding up performance pretty well for us because we made fewer (but larger) uploads. Which is not surprising -- each upload has to go through that whole connection process, and thus, we are paying a tax for each upload we do.

Still, this just barely met our performance needs, and my boss told me to ship it.

Weeks later, when things finally calmed down enough that I could breathe, I went onto the mailing list to figure out what on earth was happening with my stream.

Here is the start of the mailing list discussion.

https://mail.openjdk.org/pipermail/core-libs-dev/2024-November/134508.html

As it turns out, when a stream turns parallel, the intermediate and terminal operations you do on that stream will decide the fetching behaviour the stream uses on the source.

In our case, that meant that, if MY parallel stream used the forEach terminal operation, then the stream decides that the smartest thing to do to speed up performance is to fetch the entire dataset ahead of time and store it into an internal buffer in RAM before doing ANY PROCESSING WHATSOEVER. Resulting in an OutOfMemoryError.

And to be fair, that is not stupid at all. It makes good sense from a performance stand point. But it makes things risky from a memory standpoint.

Anyways, this is a very sharp and painful corner about parallel streams that i did not know about, so I wanted to bring it up here in case it would be useful for folks. I intend to also make a StackOverflow post to explain this in better detail.

Finally, as a silver-lining, Viktor Klang let me know that, a .gather() immediately followed by a .collect(), is immune to this pre-fetching behaviour mentioned above. Therefore, I could just create a custom Collector that does what I was doing in my forEach(). Doing it that way, I could run things in parallel safely without any fear of the dreaded OutOfMemoryError.

(and tbh, forEach() wasn't really the best idea for that operation). You can read more about it in the mailing list link above.

Please let me know if there are any questions, comments, or concerns.

EDIT -- Some minor clarifications. There are 2 issues interleaved here that makes it difficult to track the error.

  1. Gatherers don't (currently) play well with some of the other terminal operations when running in parallel.

  2. Iterators are parallel-unfriendly when operatiing as a stream source.

When I tried to boil things down to the simplistic scenario in my code above, I was no longer afflicted by problem 1, but was now afflicted by problem 2. My stream source was the source of the problem in that completely boiled down scenario.

Now that said, that only makes this problem less likely to occur than it appears. The simple reality is, it worked when running sequentially, but failed when running in parallel. And the only way I could find out that my stream source was "bad" was by diving into all sorts of libraries that create my stream. It wasn't until then that I realized the danger I was in.

Top answer
1 of 5
40
I did want to follow up about one point Viktor made later on in the conversation. https://mail.openjdk.org/pipermail/core-libs-dev/2024-November/134542.html And here is the quote. In a potential future where all intermediate operations are Gatherer-based, and all terminal operations are Collector-based, it would just work as expected. But with that said, I'm not sure it is practically achievable because some operations might not have the same performance-characteristics as before. Me personally, I would GLADLY accept a flag on stream (similar to parallel() or unordered()) that would allow me to guarantee that my stream never pre-fetches, even if I take a massive performance hit. If that can be accomplished by making all intermediate operations be implemented by a Gatherer under the hood, that is A-OK with me. The reality is, not all streams are compute bound. Some are IO bound, but are otherwise, a great fit for streams. Having a method that allows us to optimize for that fact is a new type of performance enhancement that I would greatly appreciate, even if it degrades performance in other ways.
2 of 5
11
This was a fascinanting read. Thank you for sharing. I guess it is kinda bad when higher level non-trivial apis, like streams or fork-join, do not expose lower level oprations as user-overridable constructs. Like in this example an iteration strategy for streams, or underlying executor of fork-join pool. Seems like an obvious thing to have because nobody knows better how thing will be used than end user..
🌐
YouTube
youtube.com › watch
Pros and Cons of Java 8 Parallel Streams - YouTube
This video summarizes the pros and cons of Java 8 parallel streams.
Published   October 29, 2018
🌐
DZone
dzone.com › coding › java › should i parallelize java 8 streams?
Should I Parallelize Java 8 Streams?
October 8, 2019 - A parallel stream has a much higher overhead compared to a sequential stream. Coordinating the threads takes a significant amount of time. Sequential streams sound like the default choice unless there is a performance problem to be addressed.
Top answer
1 of 3
5

You can define a custom thread pool by implementing the (Executor) interface that increases or decreases the number of threads in the pool as needed. You can submit your parallelStream chain to it as shown here using a ForkJoinPool:

I've created a working example which prints the threads that are doing the work:

import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ForkJoinPool;
import java.util.stream.Collectors;
import java.util.stream.LongStream;

public class TestParallel
{
  public static void main(String... args) throws InterruptedException, ExecutionException
  {
    testParallel();
  }
  
  
  static Long sum(long a, long b)
  {
    System.out.println(Thread.currentThread() + " - sum: " + a + " " + b);
    return a + b;
  }
  
  public static void testParallel() 
      throws InterruptedException, ExecutionException {
        
        long firstNum = 1;
        long lastNum = 10;

        List<Long> aList = LongStream.rangeClosed(firstNum, lastNum).boxed()
          .collect(Collectors.toList());

        System.out.println("custom: ");
        System.out.println();
        
        ForkJoinPool customThreadPool = new ForkJoinPool(4);
        long totalCustom = customThreadPool.submit(
          () -> aList.parallelStream().reduce(0L, TestParallel::sum)).get();
        
        System.out.println();
        System.out.println("standard: ");
        System.out.println();
        
        long totalStandard = aList.parallelStream().reduce(0L, TestParallel::sum);
        
        System.out.println();
        System.out.println(totalCustom + " " + totalStandard);
    }
}

Personally, if you want to get to that level of control, I'm not sure the streaming API is worth bothering with. It's not doing anything you can't do with Executors and concurrent libs. It's just a simplified facade to those features with limited capabilities.

Streams are kind of nice when you need to lay out a simple multi-step process in a little bit of code. But if all you are doing is using them to manage parallelism of tasks, the Executors and ExecutorService are more straightforward IMO. One thing I would avoid is pushing the number of threads above your machine's native thread count unless you have IO-bound processing. And if that's the case NIO is the more efficient solution.

What I'm not sure about is what the logic is that decides when to use multiple threads and when to use one. You'd have to better explain what factors come into play.

2 of 3
2

I don't know if this is useful but there is a design pattern called Bridge that decouples the abstraction from its implementation so you can, at runtime change between implementations.

A simple example would be a stack. For stacks where the total amount of data stored at one time is relatively small, it is more efficient to use an array. When the amount of data hits a certain point, it becomes better to use a linked-list. The stack implementation determines when it switches from one to the other.

For your case, it sounds like the processing would be behind some interface and based on the volume (do you know it before you start the processing?) your Processor class could use streams or parallel streams as appropriate.

🌐
Medium
medium.com › javarevisited › java-8-parallel-stream-java2blog-e1254e593763
Introduction to Java 8 Parallel Stream — Java2Blog | by Arpit Mandliya | Javarevisited | Medium
October 14, 2021 - Parallel Stream has much higher overhead than sequential Stream and it takes a good amount of time to coordinate between threads. You need to consider parallel Stream if and only if: You have a large dataset to process.
🌐
GeeksforGeeks
geeksforgeeks.org › java › what-is-java-parallel-streams
What is Java Parallel Streams? - GeeksforGeeks
February 21, 2025 - Java Parallel Streams is a feature of Java 8 and higher, meant for utilizing multiple cores of the processor. Normally any Java code has one stream of processing, where it is executed sequentially. Whereas by using parallel streams, we can divide the code into multiple streams that are executed in parallel on separate cores and the final result is the combination of the individual outcomes.
🌐
GeeksforGeeks
geeksforgeeks.org › java › parallel-vs-sequential-stream-in-java
Parallel vs Sequential Stream in Java - GeeksforGeeks
July 15, 2025 - Sequential stream's objects are pipelined in a single stream on the same processing system hence it never takes the advantage of the multi-core system even though the underlying system supports parallel execution. Sequential stream performs operation one by one. stream() method returns a sequential stream in Java.
🌐
Baeldung
baeldung.com › home › java › java concurrency › difference between parallelstream() and stream().parallel() in java
Difference Between parallelStream() and stream().parallel() in Java | Baeldung
April 19, 2024 - A stream created in Java is always serial in nature by default unless specified otherwise. We can convert the stream to a parallel stream in two ways: ... If not specified by the stream operation, the Java compiler and runtime decide the sequence of processing for optimal parallel computing benefits when a parallel stream operation is executed.