add distributed simulations blog post

ec1f98ab · Marvin Meiers · ccafe469 · ec1f98ab
Verified Commit ec1f98ab authored 6 months ago by Marvin Meiers
--- a/_posts/2024-09-04-distributed-simulations.md
+++ b/_posts/2024-09-04-distributed-simulations.md
+---
+title: "Distributed Simulations Using SimBricks"
+subtitle: |
+How does SimBricks help users to scale up their simulations
+by distributing them across multiple machines?
+date: 2024-09-04
+author: marvin
+permalink: /blog/distributed-simulations.html
+card_image: TODO
+---
+
+SimBricks allows users to run full-system simulations by running multiple
+simulators as [separate loosely coupled processes](loosely-coupled-simulator-processes.html).
+Scaling up full-system simulations in SimBricks is usually as easy as adding
+more simulated components, where each new component is simulated by an
+additional simulator process. Additionally, SimBricks allows decomposition of
+some simulators into multiple instances running as different processes,
+preventing them from becoming a bottleneck while scaling up. We will further
+explore decomposition of simulators in a future blog post.
+
+This approach naturally parallelizes the simulation and ensures that the
+simulation time stays low, but it also requires more resources, especially in
+form of physical CPU cores. Since SimBricks adapters are
+[polling shared memory queues](shm-message-passing.html), the simulator
+processes are always busy, which means that we should not oversubscribe the
+available CPU cores. Therefore, the size of a full-system simulation on one host
+is limited by its resources, requiring us to distribute the processes across
+multiple machines to scale the simulation beyond the limits of a single host. In
+the following we will cover how SimBricks uses proxies leveraging network
+communication to distribute simulations across multiple machines.
+
+# Scale Up By Using Separate Proxy Processes
+
+SimBricks uses message passing for communication between the different simulator
+processes. The communication between two simulator processes on the same host is
+implemented by shared memory queues. Scaling out simulations by partitioning
+components to multiple hosts can easily be accomplished by replacing the shared
+memory queues with network communication.
+
+However, directly implementing this in individual component simulators has two
+major drawbacks. First, it increases the complexity for
+[integration](integrating-simulators.html), as each simulator adapter needs to
+implement an additional message transport. Second, it increases communication
+overhead in component simulators, leaving fewer processor cycles for simulators
+and increasing simulation time. To avoid these drawbacks, we instead implement
+network communication separately in proxies.
+
+SimBricks proxies connect to local component simulators through shared memory
+queues in the same way as two simulators would connect and forward messages over
+the network to their peer proxy which operates symmetrically. This requires an
+additional processor core for the proxy on each side, but is fully transparent
+to component simulators and does not increase their communication overhead,
+since the simulator adapters stay the same.
+
+At the moment, SimBricks provides two proxy implementations supporting two
+protocols for network communication: TCP and RDMA. However, additional proxies
+can of course easily be added to support further communication protocols.
+
+SimBricks proxies also implement multiplexing, so that multiple connections of
+component simulators between two machines can be handled by the same pair of
+proxies. This reduces the number of proxies needed and therefore allows more CPU
+cores to be used for simulators.
+
+# Orchestrating Proxies
+Simbricks' [orchestration framework](orchestration_framework.html) of course
+comes with support to use the proxies and distribute full-system simulations
+across multiple machines. The user can create a distributed experiment and add
+simulation components just as with a normal non-distributed simulation. Then,
+the user adds appropriate proxies as needed to the experiment and finally
+assigns the simulation components to the different machines. When starting the
+simulation the user provides a JSON file containing information about the
+available machines, like the IP address and the working directory. The
+orchestration framework then takes care of running all simulators and proxies on
+the respective machines using SSH to execute processes on remote machines.
+
+The orchestration framework also includes an example for automatically
+distributing an experiment across two hosts, showing that this step can even be
+automated.
+
+If you have questions or would like to learn more: