java:python_differences

Parallelism Challenges

The de-facto way to parallelize code in Python is via multiprocessing. This can be done either by invoking os.fork() and managing them directly, or by using concurrent.futures.ProcessPoolExecutor.

In order to pass work between processes, the data MUST be serialized, typically using the pickle module. This has important consequences, as some language level constructs are not usable. When a task needs to be split up, (or split off), the program gathers the name and arguments, and spawn/forks off other processes to handle the work.

Functions can't be serialized

One problem with functions is that they cannot be serialized. Metadata about the function can be serialized, but the function itself cannot. This means closures don't work with multiprocessing:

def do_work(items: list[int]) -> int:
 
  def _worker(chunk: list[int]) -> int:
    count = 0
    for item in chunk:
      if item == 0:
        count += 1
    return count
 
  n = len(items)
  count = 0
  with concurrent.futures.ProcessPoolExecutor() as executor:
    count += sum(executor.map(_worker, items[:int(n / 2)]))
    count += sum(executor.map(_worker, items[int(n/2):]))
  return count
 
 
print(do_work([0, 1, 2, 3]))

Trying to run this we get:

  File "/usr/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'do_work.<locals>._worker'

Java has two advantages here:

1. Closures can be made serializable.

2. Class files have a stable ABI.

These two make it easier to run code out of process. First, closures (such as Runnable, Callable, Supplier, etc.) can be marked `Serializable`. This is less than an ideal way to represent work, but it does work. Apache Flume uses this approach with their `DoFn` for off instance parallelism. ¹⁾

Secondly, the executable code can be packaged (as a jar, classfile, etc.) and sent to a remote process / machine to execute the serialized closure. This too is somewhat error prone (e.g. class loader leaks) but workable.

Process Initialization

Python can either fork or spawn a new process to handle the additional work. Because Fork Is Problematic, spawn is usually a better choice. When spawning a process, several things that have already been done in the current process need to be done again:

Create all new Page mappings in the OS
Re load all shared objects (such as libc)
Set up logging, metrics, tracing, and any other observability stuff.
Open up any files, timers, sockets, etc.
Set up signal handlers
Re spawn threads
Run module initializers (like init.py)

This can be slower than necessary, because the child process is not able (easily) to share things from the parent process. Things like connection pools, singletons, and other shared memory objects need to be recreated. SSL / TLS connections are heavyweight to initialize. Servers don't like when you make too many connections. ²⁾.

Logging becomes more complex, as the child process need to place its logs somewhere (e.g. S3, local log daemon, the SystemD journal, etc.). All of the initialization routines will re-log this info too, noising up the logs. In order to correlate the spawned child with the parent, we need a way of grouping these logs together. Finally, we need to keep track of start and stop events, as PIDs can be re used. Most of this is a non issue with threads. This may be easier with a long lived child server process, though it becomes harder to correlate the actions in the child server process with the main process.

Threads v.s. Processes contains an overview of how threads and processes are treated differently. Python can use threads for IO bound work, but this appears to be less used in favor of asyncio.

¹⁾

There are serious consequences to using Serialization, and there is an entire chapter in Effective Java on why not to do this.

²⁾

You can imagine passing in file descriptors into the child process via a domain socket to share, but this is complex and uncommon. Plus, all the supporting data structures around the FD would need to be passes too

Table of Contents

Java and Python Differences

Parallelism Challenges

Functions can't be serialized

Process Initialization