The WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104


The
WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104

This document synthesizes the extensive work performed from March
13th to March 20th, 2026, to harden, stabilize, and refactor the
WWW::Mechanize::Chrome library and its test suite. This
effort involved deep dives into asynchronous programming,
platform-specific bug hunting, and strategic architectural
decisions.


Part I:
The Quest for Cross-Platform Stability (March 13 – 16)

The initial phase of work focused on achieving a “green” test suite
across a variety of Linux distributions and preparing for a new release.
This involved significant hardening of the library to account for
different browser versions, OS-level security restrictions, and
filesystem differences.

Key Milestones &
Engineering Decisions:

  • Fedora & RHEL-family Success: A major effort
    was undertaken to achieve a 100% pass rate on modern Fedora 43 and
    CentOS Stream 10. This required several key engineering decisions to
    handle modern browser behavior:

    • Decision: Implement Asynchronous DOM Serialization
      Fallback.
      Synchronous fallbacks in an async context are
      dangerous. To prevent Resource was not cached errors during
      saveResources, we implemented a fully asynchronous fallback
      in _saveResourceTree. By chaining
      _cached_document with DOM.getOuterHTML
      messages, we can reconstruct document content without blocking the event
      loop, even if Chromium has evicted the resource from its cache. This
      also proved resilient against Fedora’s security policies, which often
      block file:// access.
    • Decision: Truncate Filenames for Cross-Platform
      Safety.
      To avoid File name too long errors,
      especially on Windows where the MAX_PATH limit is 260
      characters, filenameFromUrl was hardened. The filename
      truncation was reduced to a more conservative 150
      characters
      , leaving ample headroom for deeply nested CI
      temporary directories. Logic was also added to preserve file extensions
      during truncation and to sanitize backslashes from URI paths.
    • Decision: Expand Browser Discovery Paths. To
      support RHEL-based systems out-of-the-box, the
      default_executable_names was expanded to include
      headless_shell and search paths were updated to include
      /usr/lib64/chromium-browser/.
    • Decision: Mitigate Race Conditions with Stabilization Waits
      and Resilient Fetching.
      On fast systems,
      DOM.documentUpdated events could invalidate
      nodeIds immediately after navigation, causing XPath queries
      to fail with “Could not find node with given id”. A small stabilization
      sleep(0.25s) was added after page loads to ensure the DOM
      is settled. Furthermore, the asynchronous DOM fetching loop was hardened
      to gracefully handle these errors by catching protocol errors and
      returning an empty string for any node that was invalidated during
      serialization, ensuring the overall process could complete.
  • Windows Hardening:
    • Decision: Adopt Platform-Aware Watchdogs. The test
      suite’s reliance on ualarm was a blocker for Windows, where
      it is not implemented. The t::helper::set_watchdog function
      was refactored to use standard alarm() (seconds) on Windows
      and ualarm (microseconds) on Unix-like systems, enabling
      consistent test-level timeout enforcement.
  • Version 0.77 Release:
    • Decision: Adopt SOP for Version Synchronization.
      The project maintains duplicate version strings across 24+ files. A
      Standard Operating Procedure was adopted to use a batch-replacement tool
      to update all sub-modules in lib/ and to always run
      make clean and perl Makefile.PL to ensure
      META.json and META.yml reflect the new
      version. After achieving stability on Linux, the project version was
      bumped to 0.77.
  • Infrastructure & Strategic Work:
    • The ad2 Windows Server 2025 instance was restored and
      optimized, with Active Directory demoted and disk I/O performance
      improved.
    • A strategic proposal for the Heterogeneous Directory
      Replication Protocol (HDRP)
      was drafted and published.

Part II: The
Great Async Refactor (March 17 – 18)

Despite success on Linux, tests on the slow ad2 Windows
host were still plagued by intermittent, indefinite hangs. This
triggered a fundamental architectural shift to move the library’s core
from a mix of synchronous and asynchronous code to a fully non-blocking
internal API.

Key Milestones &
Engineering Decisions:

  • Decision: Expose a _future API.
    Instead of hardcoding timeouts in the library, the core strategy was to
    refactor all blocking methods (xpath, field,
    get, etc.) into thin wrappers around new non-blocking
    ..._future counterparts. This moved timeout management to
    the test harness, allowing for flexible and explicit handling of
    stalls.

    # Example library implementation
    sub xpath($self, $query, %options) {
        return $self->xpath_future($query, %options)->get;
    }
    
    sub xpath_future($self, $query, %options) {
        # Async implementation using $self->target->send_message(...)
    }
  • Decision: Centralize Test Hardening in a Helper.
    A dedicated test library, t/lib/t/helper.pm, was created to
    contain all stabilization logic. “Safe” wrappers (safe_get,
    safe_xpath) were implemented there, using
    Future->wait_any to race asynchronous operations against
    a timeout, preventing tests from hanging.

    # Example test helper implementation
    sub safe_xpath {
        my ($mech, $query, %options) = @_;
        my $timeout = delete $options{timeout} || 5;
        my $call_f = $mech->xpath_future($query, %options);
        my $timeout_f = $mech->sleep_future($timeout)->then(sub { Future->fail("Timeout") });
        return Future->wait_any($call_f, $timeout_f)->get;
    }
  • Decision: Refactor Node Attribute Cache.
    Investigations into flaky checkbox tests (t/50-tick.t)
    revealed that WWW::Mechanize::Chrome::Node was storing
    attributes as a flat list ([key, val, key, val]), which was
    inefficient for lookups and individual updates. The cache was refactored
    to definitively use a HashRef, providing O(1) lookups
    and enabling atomic dual-updates where both the browser property (via
    JS) and the internal library attribute are synchronized
    simultaneously.

  • Decision: Implement Self-Cancelling Socket
    Watchdog.
    On Windows, traditional watchdog processes often
    failed to detect parent termination, leading to 60-second hangs after
    successful tests. We implemented a new socket-based watchdog in
    t::helper that listens on an ephemeral port; the background
    process terminates immediately when the parent socket closes,
    eliminating these cumulative delays.

  • Decision: Deep Recursive Refactoring & Form
    Selection.
    To make the API truly non-blocking, the entire
    internal call stack had to be refactored. For example, making
    get_set_value_future non-blocking required first making its
    dependency, _field_by_name, asynchronous. This culminated
    in refactoring the entire form selection API (form_name,
    form_id, etc.) to use the new asynchronous
    _future lookups, which was a key step in mitigating the
    Windows deadlocks.

  • Decision: Fix Critical Regressions & Memory
    Cycles.

    • Evaluation Normalization: Implemented a
      _process_eval_result helper to centralize the parsing of
      results from Runtime.evaluate. This ensures consistent
      handling of return values and exceptions between synchronous
      (eval_in_page) and asynchronous (eval_future)
      calls.

    • Memory Cycle Mitigation: A significant memory
      leak was discovered where closures attached to CDP event futures (like
      for asynchronous body retrieval) would capture strong references to
      $self and the $response object, creating a
      circular reference. The established rule is to now always use
      Scalar::Util::weaken on both $self and any
      other relevant objects before they are used inside a
      ->then block that is stored on an object.

    • Context Propagation (wantarray): A
      major regression was discovered where Perl’s wantarray
      context, which distinguishes between scalar and list context, was lost
      inside asynchronous Future->then blocks. This caused
      methods like xpath to return incorrect results (e.g., a
      count instead of a list of nodes). The solution was to adopt the “Async
      Context Pattern”: capture wantarray in the synchronous
      wrapper, pass it as an option to the _future method, and
      then use that captured value inside the future’s final resolution
      block.

      # Synchronous Wrapper
      sub xpath($self, $query, %options) {
          $options{ wantarray } = wantarray; # 1. Capture
          return $self->xpath_future($query, %options)->get; # 2. Pass
      }
      
      # Asynchronous Implementation
      sub xpath_future($self, $query, %options) {
          my $wantarray = delete $options{ wantarray }; # 3. Retrieve
          # ... async logic ...
          return $doc->then(sub {
              if ($wantarray) { # 4. Respect
                  return Future->done(@results);
              } else {
                  return Future->done($results[0]);
              }
          });
      }
    • Asynchronous Body Retrieval & Robust Content
      Fallbacks:
      Fixed a bug where decoded_content()
      would return empty strings by ensuring it awaited a
      __body_future. This was implemented by storing the
      retrieval future directly on the response object
      ($response->{__body_future}). To make this more robust,
      a tiered strategy was implemented: first try to get the content from the
      network response, but if that fails (e.g., for about:blank
      or due to cache eviction), fall back to a JavaScript
      XMLSerializer to get the live DOM content.

    • Signature Hardening: Fixed “Too few arguments”
      errors when using modern Perl signatures with
      Future->then. Callbacks were updated to use optional
      parameters (sub($result = undef) { ... }) to gracefully
      handle futures that resolve with no value.

    • XHTML “Split-Brain” Bug: Resolved a
      long-standing Chromium bug (40130141) where content provided via
      setDocumentContent is parsed differently than content
      loaded from a URL. A workaround was implemented: for XHTML documents,
      WMC now uses a JavaScript-based XPath evaluation
      (document.evaluate) against the live DOM, bypassing the
      broken CDP search mechanism.

Derived Architectural Rules
& SOPs:

  • Rule: Always provide _future variants.
    Every library method that interacts with the browser via CDP must have a
    non-blocking asynchronous counterpart.
  • Rule: Centralize stabilization in the test layer.
    All timeout and retry logic should reside in the test harness
    (t/lib/t/helper.pm), not in the core library.
  • Rule: Explicitly propagate wantarray
    context.
    Synchronous wrappers must capture the caller’s context
    and pass it down the Future chain to ensure correct
    scalar/list behavior.
  • Rule: The entire call chain must be asynchronous.
    To enable non-blocking timeouts, even a single “hidden” blocking call in
    an otherwise asynchronous method will cause a stall.
  • SOP: Reduce Library Noise. Diagnostic messages
    (warn, note, diag) should be
    removed from library code before commits. All such messages should be
    converted to use the internal $self->log('debug', ...)
    mechanism, ensuring a clean TAP output for CI systems.

Part III: The
MutationObserver Saga (March 19)

With most of the library refactored to be asynchronous, one stubborn
test, t/65-is_visible.t, continued to fail with timeouts.
This led to an ambitious, but ultimately unsuccessful, attempt to
replace the wait_until_visible polling logic with a more
“modern” MutationObserver.

Key Milestones & Challenges:

  • The Theory: The goal was to replace an inefficient
    repeat { sleep } loop with an event-driven
    MutationObserver in JavaScript that would notify Perl
    immediately when an element’s visibility changed.
  • Implementation & Cascade Failure: The
    implementation proved incredibly difficult and introduced a series of
    new, hard-to-diagnose bugs:

    1. An incorrect function signature for
      callFunctionOn_future.
    2. A critical unit mismatch, passing seconds from Perl to JavaScript’s
      setTimeout, which expected milliseconds.
    3. A fundamental hang where the MutationObserver’s
      JavaScript Promise would never resolve, even after the
      underlying DOM element changed.
  • Debugging Maze: Multiple attempts to fix the
    checkVisibility JavaScript logic inside the observer
    callback, including making it more robust by adding DOM tree traversal
    and extensive console.log tracing, failed to resolve the
    hang. This highlighted the opacity and difficulty of debugging complex,
    cross-language asynchronous interactions, especially when dealing with
    low-level browser APIs.

Procedural Learning:
Granular Edits

The effort was plagued by procedural missteps in using automated
file-editing tools. Initial attempts to replace large code blocks in a
single operation led to accidental code loss and match failures.

  • Decision: Adopt “Delete, then Add” Workflow.
    Following forceful user correction, a new SOP was established for all
    future modifications:

    1. Isolate: Break the file into small, manageable
      chunks (e.g., 250 lines).
    2. Delete: Perform a “delete” operation by replacing
      the old code block with an empty string.
    3. Add: Perform an “add” operation by inserting the
      new code into the empty space.
    4. Verify: Verifying each atomic step before
      proceeding. This granular process, while slower, ensured surgical
      precision and regained technical control over the large
      Chrome.pm module.

The consistent failure of the MutationObserver approach
eventually led to the decision to abandon it in favor of stabilizing the
original, more transparent implementation.


Part IV:
Reversion and Final Stabilization (March 20)

After exhausting all reasonable attempts to fix the
MutationObserver, a strategic decision was made to revert
to the simpler, more transparent polling implementation and fix it
correctly. This proved to be the correct path to a stable solution.

Key Milestones &
Engineering Decisions:

  • Decision: Perform Strategic Reversion. The
    MutationObserver implementation, when integrated via
    callFunctionOn_future with awaitPromise,
    proved fundamentally unstable. Its JavaScript promise would consistently
    fail to resolve, causing indefinite hangs. A decision was made to
    revert all MutationObserver code from
    WWW::Mechanize::Chrome.pm and restore the original
    repeat { sleep } polling mechanism. A stable,
    understandable solution was prioritized over an elegant but broken
    one.
  • Decision: Correct Timeout Delegation in the
    Harness.
    The root cause of the original timeout failure was
    identified as a race condition in the t/lib/t/helper.pm
    test harness. The safe_wait_until_* wrappers were
    implementing their own timeout (via wait_any and
    sleep_future) that raced against the underlying polling
    function’s internal timeout. This led to intermittent failures on slow
    machines. The helpers were refactored to delegate all timeout
    management to the library’s polling functions
    , ensuring a
    single, authoritative timer controlled the operation.
  • Decision: Optimize Polling Performance. At the
    user’s request, the polling interval was reduced from 300ms to
    150ms. This modest performance improvement reduced the
    test suite’s wallclock execution time by over a second while maintaining
    stability.
  • Decision: Tune Test Watchdogs. The global watchdog
    timeout was adjusted to 12 seconds, specifically calculated as 1.5x the
    observed real execution time of the optimized test. This provides a
    data-driven safety margin for CI.

Part
V: The Last Bug – A Platform-Specific Memory Leak (March 20)

With all other tests passing, a single memory leak failure in
t/78-memleak.t persisted, but only on the Windows
ad2 environment. This required a different approach than
the timeout fixes.

Key Milestones:

  • The Bug: A strong reference cycle involving the
    on_dialog event listener was not being broken on Windows,
    despite multiple attempts to fix it. Fixes that worked on Linux (such as
    calling on_dialog(undef) in DESTROY) were not
    sufficient on the Windows host.
  • The Diagnosis: The issue was determined to be a
    deep, platform-specific interaction between Perl’s garbage collector,
    the IO::Async event loop implementation on Windows, and the
    Test::Memory::Cycle module. The cycle report was identical
    on both platforms, but the cleanup behavior was different.
  • Failed Attempts: A series of increasingly
    aggressive fixes were attempted to break the cycle, including:

    1. Moving the on_dialog(undef) call from
      close() to DESTROY().
    2. Explicitly deleteing the listener and callback
      properties from the object hash in DESTROY.
    3. Swapping between $self->remove_listener and
      $self->target->unlisten in a mistaken attempt to find
      the correct un-registration method.
  • Pragmatic Solution: After exhausting all reasonable
    code-level fixes without a resolution on Windows, the user opted to mark
    the failing test as a known issue for that specific platform.
  • Final Fix: The single failing test in
    t/78-memleak.t was wrapped in a conditional
    TODO block that only executes on Windows
    (if ($^O =~ /MSWin32/i)), formally acknowledging the bug
    without blocking the build. This allows the test suite to pass in CI
    environments while flagging the issue for future, deeper
    investigation.

Part VI: CI Hardening (March
20)

A final failure in the GitHub Actions CI environment revealed one
last configuration flaw.

Key Milestones:

  • The Bug: The CI was running
    prove --nocount --jobs 3 -I local/ -bl xt t directly. This
    command was missing the crucial -It/lib include path, which
    is necessary for test files to locate the t::helper module.
    This resulted in nearly all tests failing with
    Can't locate t/helper.pm in @INC.
  • The Investigation: An analysis of
    Makefile.PL revealed a custom MY::test block
    specifically designed to inject the -It/lib flag into the
    make test command. This confirmed that
    make test is the correct, canonical way to run the test
    suite for this project.
  • The Fix: The
    .github/workflows/linux.yml file was modified to replace
    the direct prove call with make test in the
    Run Tests step. This ensures the CI environment runs the
    tests in the exact same way as a local developer, with all necessary
    include paths correctly configured by the project’s build system.

Final Outcome

After this long and arduous journey, the
WWW::Mechanize::Chrome test suite is now stable and
passing on all targeted platforms, with known
platform-specific issues clearly documented in the code. The project is
in a vastly more robust and reliable state.


Leave a Reply