Differences

This shows you the differences between two versions of the page.

--- java:python_differences [2025/04/03 14:41] – created carl
+++ java:python_differences [2025/04/03 17:23] (current) – carl
@@ Line 48: / Line 48: @@
 </code>
+Java has two advantages here:
+.  Closures can be made serializable.
+.  Class files have a stable ABI.
+These two make it easier to run code out of process.  First, closures (such as ''Runnable'', ''Callable'', ''Supplier'', etc.) can be marked `Serializable`.  This is less than an ideal way to represent work, but it does work.  Apache Flume uses this approach with their `DoFn` for off instance parallelism.  ((There are serious consequences to using Serialization, and there is an entire chapter in Effective Java on why not to do this.))
+Secondly, the executable code can be packaged (as a jar, classfile, etc.) and sent to a remote process / machine to execute the serialized closure.   This too is somewhat error prone (e.g. class loader leaks) but workable.
+==== Process Initialization ====
+Python can either fork or spawn a new process to handle the additional work.  Because [[https://github.com/thatch/fork-is-problematic|Fork Is Problematic]], spawn is usually a better choice.  When spawning a process, several things that have already been done in the current process need to be done again:
+    - Create all new Page mappings in the OS
+    - Re load all shared objects (such as libc)
+    - Set up logging, metrics, tracing, and any other observability stuff.
+    - Open up any files, timers, sockets, etc.
+    - Set up signal handlers
+    - Re spawn threads
+    - Run module initializers (like ''__init__.py'')
+This can be slower than necessary, because the child process is not able (easily) to share things from the parent process.  Things like connection pools, singletons, and other shared memory objects need to be recreated.   SSL / TLS connections are heavyweight to initialize.   Servers don't like when you make too many connections.  ((You can imagine passing in file descriptors into the child process via a domain socket to share, but this is complex and uncommon.  Plus, all the supporting data structures around the FD would need to be passes too)).
+Logging becomes more complex, as the child process need to place its logs somewhere (e.g. S3, local log daemon, the SystemD journal, etc.).  All of the initialization routines will re-log this info too, noising up the logs.   In order to correlate the spawned child with the parent, we need a way of grouping these logs together.   Finally, we need to keep track of start and stop events, as PIDs can be re used.  Most of this is a non issue with threads.  This may be easier with a long lived child server process, though it becomes harder to correlate the actions in the child server process with the main process.
 [[:threads_vs_processes|Threads v.s. Processes]] contains an overview of how threads and processes are treated differently.  Python can use threads for IO bound work, but this appears to be less used in favor of ''asyncio''.