Skip to content
Snippets Groups Projects
Verified Commit ec1f98ab authored by Marvin Meiers's avatar Marvin Meiers
Browse files

add distributed simulations blog post

parent ccafe469
No related branches found
No related tags found
No related merge requests found
Pipeline #106319 passed
---
title: "Distributed Simulations Using SimBricks"
subtitle: |
How does SimBricks help users to scale up their simulations
by distributing them across multiple machines?
date: 2024-09-04
author: marvin
permalink: /blog/distributed-simulations.html
card_image: TODO
---
SimBricks allows users to run full-system simulations by running multiple
simulators as [separate loosely coupled processes](loosely-coupled-simulator-processes.html).
Scaling up full-system simulations in SimBricks is usually as easy as adding
more simulated components, where each new component is simulated by an
additional simulator process. Additionally, SimBricks allows decomposition of
some simulators into multiple instances running as different processes,
preventing them from becoming a bottleneck while scaling up. We will further
explore decomposition of simulators in a future blog post.
This approach naturally parallelizes the simulation and ensures that the
simulation time stays low, but it also requires more resources, especially in
form of physical CPU cores. Since SimBricks adapters are
[polling shared memory queues](shm-message-passing.html), the simulator
processes are always busy, which means that we should not oversubscribe the
available CPU cores. Therefore, the size of a full-system simulation on one host
is limited by its resources, requiring us to distribute the processes across
multiple machines to scale the simulation beyond the limits of a single host. In
the following we will cover how SimBricks uses proxies leveraging network
communication to distribute simulations across multiple machines.
# Scale Up By Using Separate Proxy Processes
SimBricks uses message passing for communication between the different simulator
processes. The communication between two simulator processes on the same host is
implemented by shared memory queues. Scaling out simulations by partitioning
components to multiple hosts can easily be accomplished by replacing the shared
memory queues with network communication.
However, directly implementing this in individual component simulators has two
major drawbacks. First, it increases the complexity for
[integration](integrating-simulators.html), as each simulator adapter needs to
implement an additional message transport. Second, it increases communication
overhead in component simulators, leaving fewer processor cycles for simulators
and increasing simulation time. To avoid these drawbacks, we instead implement
network communication separately in proxies.
SimBricks proxies connect to local component simulators through shared memory
queues in the same way as two simulators would connect and forward messages over
the network to their peer proxy which operates symmetrically. This requires an
additional processor core for the proxy on each side, but is fully transparent
to component simulators and does not increase their communication overhead,
since the simulator adapters stay the same.
At the moment, SimBricks provides two proxy implementations supporting two
protocols for network communication: TCP and RDMA. However, additional proxies
can of course easily be added to support further communication protocols.
SimBricks proxies also implement multiplexing, so that multiple connections of
component simulators between two machines can be handled by the same pair of
proxies. This reduces the number of proxies needed and therefore allows more CPU
cores to be used for simulators.
# Orchestrating Proxies
Simbricks' [orchestration framework](orchestration_framework.html) of course
comes with support to use the proxies and distribute full-system simulations
across multiple machines. The user can create a distributed experiment and add
simulation components just as with a normal non-distributed simulation. Then,
the user adds appropriate proxies as needed to the experiment and finally
assigns the simulation components to the different machines. When starting the
simulation the user provides a JSON file containing information about the
available machines, like the IP address and the working directory. The
orchestration framework then takes care of running all simulators and proxies on
the respective machines using SSH to execute processes on remote machines.
The orchestration framework also includes an example for automatically
distributing an experiment across two hosts, showing that this step can even be
automated.
If you have questions or would like to learn more:
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment