blog: revise design-space-sweeps

0e6aab1e · Jonas Kaufmann · 8cba6750 · 0e6aab1e
Commit 0e6aab1e authored 6 months ago by Jonas Kaufmann
--- a/_posts/2024-08-28-design-space-sweeps.md
+++ b/_posts/2024-08-28-design-space-sweeps.md
 ---
 title: Realistic & Fast Design Space Sweeps
 subtitle: How SimBricks Enables the Exploration of System- and Component-Level Design Choices When a Physical Testbed is Infeasible
-date: 2024-08-29
+date: 2024-08-26
 author: jonas
 permalink: /blog/design-space-sweeps.html
 card_image: TODO
 ---
-Designing tomorrow’s heterogeneous systems is anything but straight-forward. We
-are faced with many system-level but also component-level design choices.
-Experienced system architects can immediately dismiss a bunch of configurations.
-However, as Marvin showed in [this prior blog
+Designing tomorrow’s heterogeneous systems is anything but straight-forward.
+System architects are faced with many system- and component-level design
+choices. Experienced system architects can immediately dismiss a bunch of those.
+However, as Marvin highlighted in [this prior blog
 post](https://www.simbricks.io/blog/need-for-e2e-simulation.html), even then,
-complex interactions between components often make end-to-end performance
-impossible to predict and thorough evaluation is pretty much always required.
-Unfortunately, realistic physical testbeds that have a similar scale than the
-production system we are designing are probably also infeasible.
-
-We built SimBricks for exactly this task. No matter where in the design process
-you stand, we allow you to assemble your complete system in simulation and do
-thorough, realistic end-to-end evaluation with your actual, unmodified
-workloads. Figuring out the best configuration, i.e. performing design space
-sweeps, is a first class citizen of our orchestration framework. We even support
-running them in parallel if enough compute resources are available. Let me show
-you what I mean.
-
-# Working Example & Parameters to Play With
+complex interactions between components make final end-to-end performance very
+hard to predict, requiring thorough evaluation early on and throughout the
+design process to avoid expensive mistakes discovered only late. Unfortunately,
+building a physical testbed for evaluation takes money to buy the necessary
+components and a lot of engineering hours to integrate all pieces, just to throw
+everything away if this doesn't work out. For large-scale systems with many
+components, doing this is completely infeasible.
+
+We built SimBricks to tackle exactly this problem. No matter where in the design
+process system architects stand, SimBricks allows them to assemble their
+complete system in simulation and do thorough end-to-end evaluation with their
+actual workloads and software. Doing design space sweeps, i.e. figuring out the
+best design choices, is a first class citizen of our orchestration framework. We
+even support running them in parallel if enough compute resources are available.
+Let me show you what I mean.
+
+# Even Simple Systems Have a Huge Design Space

 ![Figure showing a heterogeneous system with M clients connected to an external
 network and N servers with X hardware accelerators each, which are connected to
 an internal network on the other side. There's also a load balancer in the
 internal network.](/assets/images/blog/2024-08-28-design-space-sweeps.svg)

-This is the system we are going to use as our example. We have M clients
-connected to some external network. These send requests to the load balancer in
-the internal network. Their requests are then forwarded and served by one of N
-servers with X hardware accelerators each. Here, M, N, and X are system-level
-design parameters. The external network and clients are fixed but for the
-internal network we can freely choose the topology, link speeds, etc.
-
-We also have component-level choices like the number of cores and amount of
-memory available at servers, or architectural parameters of our hardware
-accelerator, for example clock-speed and the dimensions of the internal compute
-array.
-
-To capture more realism, we are also going to add background traffic to the
-internal network. We parameterize it in its traffic volume as the percentage of
-theoretical max throughput.
-
-# Use the Full Python Machinery to Build SimBricks Experiments!
-
-In a [prior blog post](https://www.simbricks.io/blog/orchestration_framework.html), Hejing illustrates how to easily cast a system design into an experiment in the SimBricks orchestration framework. TL;DR: You just instantiate a few classes. However, there’s no restriction here that forces you to just instantiate one experiment per Python module. Instead, we can construct one for every combination of parameters that we want to evaluate. Since this is Python and we are just instantiating classes, you can use your favorite Python constructs to do so! I decided to go for `itertools.product()` and and a few simple for-loops:
+This is the system we are going to use as our working example. We have M clients
+connected to an external network. Both are given by the customer and can't be
+changed. The clients send requests to the load balancer in the internal network,
+which then forwards them to one of N servers with X hardware accelerators each.
+Here, N and X are system-level design parameters the system architect can play
+with. They can also freely choose what the internal network looks like in terms
+of topology, link speeds, etc. Further, we have component-level parameters like
+the number of cores and amount of memory available at servers, and architectural
+choices for the hardware accelerators like clock-speed and the dimensions of
+their inner compute grid. Realistically, both networks are also going to have
+background traffic.
+
+Even for this rather simple system, we can already ask a bunch of questions that
+need evaluation for reliable answers: Given that the customer wants to have M
+clients, how many servers N do we need to achieve the service-level objectives,
+for example a guaranteed maximum request latency? Can we reduce the number of
+servers required by introducing hardware accelerators? How is all this
+influenced by background traffic? Can we reduce costs for building the internal
+network by prioritizing client-server traffic over background traffic with the
+help of smart network switches?
+
+# Let's do some Evaluation with SimBricks!
+
+To simulate the system we just saw with SimBricks, we need a simulator for each
+component. You decide the level of detail you need here! For the hardware
+accelerator, we can quickly write up a behavioral model in C++, which already
+allows us to answer what if questions. But most importantly, we are going to run
+the actual software and workloads of our customer to measure the end-to-end
+properties we care about.
+
+For building the simulation, you write a Python script for the SimBricks
+orchestration framework that describes the system you want to simulate and which
+simulators to use. In this [prior blog
+post](https://www.simbricks.io/blog/orchestration_framework.html), Hejing
+illustrates such a script.
+
+However, there’s no restriction here that forces us to just instantiate one
+experiment per Python module. Instead, we can construct one for every
+combination of parameters that we want to evaluate. Since this is Python and we
+are just instantiating classes, feel free to use your favorite Python constructs
+to do so! I decided to go for `itertools.product()` and a few simple for-loops:

 ```python
 from simbricks.orchestration import experiments as exp
@@ -63,24 +88,24 @@ num_clients_opts = [4, 16, 128]
 num_servers_opts = [1, 2, 4, 8]
 num_accel_per_server_opts = [1, 2]
 accel_clk_freq_opts = [100, 400]
-background_load_opts = [0.5, 0.8]
+background_traffic_opts = [0.5, 0.8]

 for (
    num_clients,
    num_servers,
    num_accelerators,
    accel_clk_freq,
-    background_load,
+    background_traffic,
 ) in itertools.product(
    num_clients_opts,
    num_servers_opts,
    num_accel_per_server_opts,
    accel_clk_freq_opts,
-    background_load_opts,
+    background_traffic_opts,
 ):
    experiment = exp.Experiment(
        f"<experiment_name>-{num_servers}s-{num_clients}c-"
-        f"{num_accelerators}x-{accel_clk_freq}-{background_load}"
+        f"{num_accelerators}x-{accel_clk_freq}-{background_traffic}"
    )
 
    # Instantiate external & internal network, add background traffic
@@ -103,8 +128,17 @@ for (



-# Parallel Design Space Sweeps
+# Fast Design Space Sweeps by Running Experiments in Parallel

-In SimBricks, individual experiments can be run independently and thereby in parallel to do fast design space sweeps. Our orchestration framework even automates this for you if you invoke `simbricks-run` with the `--parallel` flag. Parallelizing on the same machine isn’t always possible though. Due to how we establish communication between simulators in the form of shared memory queues, which use active polling for maximum efficiency (learn more about this [here](https://www.simbricks.io/blog/shm-message-passing.html)), no simulator can share a physical thread with another or else simulations become very slow.
+To make design space sweeps faster, SimBricks allows you to run experiments in
+parallel. Our orchestration framework even automates this if you invoke
+`simbricks-run` with the `--parallel` flag. Parallelizing on the same machine
+isn’t always possible though. Due to how we establish communication between
+simulators with shared memory queues, which use polling for maximum efficiency
+(learn more about this
+[here](https://www.simbricks.io/blog/shm-message-passing.html)), simulators
+mustn't share physical threads or else simulations become very slow.

-But even in this case, our orchestration framework offers distributed simulations, where simulations are run on multiple machines in parallel. Stay tuned for more about this! Until then:
+But even in this case and to explore even more design choices in parallel, our
+orchestration framework offers distributed simulations, where experiments are
+run on multiple machines. Stay tuned for more on this! Until then: