What is a Distributed System?

A distributed system is one in which components are located on various computers connected by a network. These computers communicate by exchanging messages to coordinate their actions.

Components of a Distributed System

Nodes - The various components that make up a distributed system

Network - The method by which nodes in distributed systems communicate with one another

Why Do We Need Distributed Systems?

Performance

There are limits to what a single node can accomplish. Each machine has hardware-based constraints. We can enhance a machine's hardware by adding more RAM and CPU, but beyond a certain point, it becomes very costly to improve the performance of a single computer.

Instead, we can achieve similar results using fewer, more affordable machines.

Scalability

Most computer systems handle information, being responsible for storing and processing data. Since a single machine's performance can only be scaled up to a certain point, we require multiple machines to manage the vast amount of data we encounter today. A single computer cannot handle all of your requests.

By using multiple machines, we can store and process data more efficiently by dividing the tasks among them.

Availability

Ensuring that most services are available 24 hours a day, 7 days a week poses a significant challenge. At any moment, a single machine can malfunction. If your service goes down, you'll immediately lose money. Storing all of your data on one machine is risky; if that machine crashes, you lose everything.

To achieve high availability, we need multiple machines so that if one fails, we can swiftly transition to another.

Challenges in Designing Distributed Systems

Network Asynchrony

Communication networks exhibit a property known as asynchrony, which indicates that there is no way to predict the time it takes to transfer an event from one machine to another. Occasionally, events may occur out of sequence, complicating the development of distributed systems.

To better understand this, let's consider an example. Imagine a user on a social media site initially dislikes a post, but then realizes they meant to like it and changes their vote. Due to the network's asynchronous nature, it's possible that the "like" was received and processed before the "dislike." The intended outcome was for the post to be liked, but since the messages were transmitted out of order and the dislike was the second message sent, the system marked the post as disliked.

Partial Failures

A "partial failure" occurs when some components of your system fail. If the application does not account for this, it may lead to undesirable outcomes.

For instance, imagine having multiple machines where your users' data is distributed, and you lose connection with one of them. Users with data stored on that machine will have to wait until it is back online.

This situation also complicates matters when attempting to perform atomic transactions while some nodes are down.

Concurrency

Concurrency means doing many calculations at the same time, sometimes on the same information. This can make things harder because the calculations might affect each other and cause unwanted results.

Measuring Correctness

How can we determine if a system is functioning correctly or as intended? There are two primary factors that help us assess whether a system is accurate or flawed:

Safety

A safety property stipulates that a certain event within the system must never occur.

For example, if we think of a bicycle as a system, a safety rule would say that the wheel must always be attached to the bike when it's moving. If the wheel comes off while the bicycle is moving, bad things might happen.

Liveness

A liveness property defines an event that must eventually occur in a system.

In the case of a bicycle system, liveness might mean that the bike should move when pedaled. The cycle should stop when brakes are applied.

Sriram R's Blog