How 'Mark and Sweep' Garbage Collection Works: A Deep Dive

If you have ever written C or C++, you are likely familiar with the double-edged sword of manual memory management. You wield complete control over every byte using malloc and free, but with that power comes the heavy responsibility of tracking every allocation. Forget to free memory? You have a leak. Free it too early? You have a dangling pointer and a potential segmentation fault.

Modern high-level languages like Java, JavaScript (via engines like V8), and Python abstract this complexity away. Developers can instantiate objects freely, trusting the runtime to clean up the mess. This process is handled by the Garbage Collector (GC), the unsung hero of modern software development.

While garbage collection feels like magic, it is strictly algorithmic. The foundation for the garbage collectors found in the JVM and V8 is a classic algorithm known as Mark and Sweep. In this article, we will demystify how this algorithm identifies dead memory, how it reclaims it, and the performance costs associated with automated hygiene.

The Core Concept: Reachability Analysis

To understand Mark and Sweep, we must first invert our thinking. The Garbage Collector does not hunt for "garbage" (dead objects) directly. Instead, it hunts for "live" objects. If an object is alive, it is kept; everything else is considered waste. This determination is made through Reachability Analysis.

Defining the 'Roots'

The analysis begins at specific starting points known as GC Roots. A GC Root is an object that is inherently accessible and must not be deleted. Common examples include:

  • Stack Variables: Local variables and parameters currently on the execution stack (in active function calls).
  • Global Variables: Objects in the global scope (e.g., the window object in browsers or static fields in Java).
  • Active Execution Contexts: Closures or timers/intervals that are currently running.

The Object Graph

Memory in a managed runtime can be visualized as a directed graph.

  • Nodes represent objects (structs, arrays, instances).
  • Edges represent references (pointers) from one object to another.

An object is considered "reachable" if there is a chain of references connecting it to a GC Root. If an object exists in the heap but has no path linking it back to a root, it is unreachable. It is essentially an island in memory, inaccessible to the running code, and therefore safe to delete.

// Simplified visualization of reachability
const root = {
  type: "Global Window",
  references: [objectA] // Edge from Root -> ObjectA
};

const objectA = {
  id: "A",
  references: [objectB] // Edge from A -> B
};

const objectB = {
  id: "B",
  references: []
};

// objectC has no incoming edges from Root, A, or B
const objectC = {
  id: "C",
  references: []
};
// Result: objectC is unreachable and will be swept.

Step-by-Step: The Mark and Sweep Algorithm

As the name implies, this algorithm operates in two distinct phases: Marking and Sweeping.

Phase 1: The Mark Phase

When the GC is triggered, it pauses the application and begins the marking phase. The collector assumes all objects are garbage (unmarked) until proven otherwise.

  1. The collector starts traversing the object graph from the GC Roots.
  2. It uses graph traversal algorithms (typically Depth-First Search or Breadth-First Search) to visit every object connected to a root.
  3. Every object visited has a specific "mark bit" in its header set to true (or 1).
  4. If the traversal encounters an object that is already marked, it stops moving down that path to prevent infinite loops.

By the end of this phase, every object reachable by the application has been flagged as "live."

Phase 2: The Sweep Phase

Once the traversal is complete, the Sweep phase begins. The GC scans the entire heap memory linearly, from start to finish.

  1. It inspects the mark bit of every object.
  2. If the bit is True (Marked): The object is alive. The GC resets the bit to false to prepare for the next cycle.
  3. If the bit is False (Unmarked): The object is unreachable. The GC reclaims this memory, adding the address range back to the "free list" so it can be used for future allocations.

Handling Cycles

One of the primary advantages of Mark and Sweep over older methods, like Reference Counting, is its ability to handle circular references.

Consider two objects, A and B, that reference each other but are no longer referenced by the main application.

  • Reference Counting: Would see that A has a reference count of 1 (from B) and B has a count of 1 (from A), preventing them from ever being deleted.
  • Mark and Sweep: Since neither A nor B can be reached from a GC Root, the Mark phase never visits them. They remain unmarked and are successfully swept away.
// The cycle problem solved by Mark and Sweep
function createCycle() {
  const obj1 = {};
  const obj2 = {};
  
  // Circular reference
  obj1.a = obj2;
  obj2.a = obj1;
  
  return "Finished";
}

createCycle();
// Once the function exits, obj1 and obj2 are still referencing each other.
// However, they are detached from the Root (stack).
// Mark and Sweep correctly identifies them as unreachable.

The Cost of Cleanliness: Performance Trade-offs

While Mark and Sweep ensures memory safety, it is not free. There are two significant costs that engineers must be aware of.

Stop-the-World (STW) Pauses

The most infamous side effect of garbage collection is the "Stop-the-World" event. To safely traverse the graph and mark objects, the GC must guarantee that the object graph doesn't change during the process. If the application were allowed to create new objects or change pointers while the GC was marking, the collector could miss live objects or delete data that was just allocated.

Consequently, the application execution is completely paused. In older runtime versions, this could cause noticeable stuttering or latency spikes in the application.

Memory Fragmentation

The Sweep phase reclaims memory, but it doesn't necessarily organize it. Imagine a block of memory like a row of houses. If you demolish every third house, you have plenty of empty lots, but you might not have enough contiguous space to build a shopping mall.

This is fragmentation. After a sweep, the heap is often riddled with gaps (free memory) scattered between live objects. If the application tries to allocate a large array that requires a contiguous block of memory, the allocation might fail even if the total free memory is sufficient. To solve this, modern GCs often include a Compact phase (Mark-Sweep-Compact), which shifts live objects together to close the gaps, though this adds significant processing overhead.

Modern Optimizations: Generational Garbage Collection

To mitigate the costs of Stop-the-World pauses and frequent scanning, modern engines like V8 (Chrome/Node.js) and HotSpot (JVM) utilize Generational Garbage Collection.

The Weak Generational Hypothesis

Empirical observation of software behavior led to the "Weak Generational Hypothesis," which states:

Most objects die young.

In a typical application, temporary variables, loop iterators, and request-response objects are created and discarded almost immediately. Very few objects (like application configuration or heavy caches) stay alive for the duration of the process.

Young Generation (Eden) vs. Old Generation

Based on this hypothesis, the heap is divided into two main areas:

  1. Young Generation (Eden Space): New objects are allocated here. Because most die quickly, this space fills up fast.
  2. Old Generation: Objects that survive multiple GC cycles in the Young Generation are "promoted" to the Old Generation.

Minor vs. Major GC

This separation allows the GC to run extremely efficient Minor GCs only on the Young Generation. Since most objects in Eden are dead, the Mark phase is very fast (it only marks a handful of survivors), and the survivors are quickly moved out. This keeps STW pauses negligible.

A Major GC (Full Mark and Sweep) is triggered only when the Old Generation fills up. These are the expensive, longer pauses, but thanks to the generational split, they occur much less frequently.

Conclusion

The transition from manual memory management to automatic Garbage Collection has revolutionized developer productivity. While the underlying mechanics of Mark and Sweep—tracing roots, marking nodes, and sweeping the heap—are complex, understanding them is crucial for backend and frontend performance.

By understanding the distinction between the Mark and Sweep phases, the inevitability of Stop-the-World pauses, and the efficiency of Generational GC, you can write code that is friendlier to the engine—avoiding memory leaks and reducing the frequency of costly Major GC cycles.

Want to see algorithms in action? Check out our Text Transformer tool to clean up your own data strings instantly.