How We Built Real-Time Collaboration Into Our Document Editor

When we started building Quixli, real-time collaboration wasn't on the roadmap. The initial product was focused on a different problem — creating richly formatted documents and sharing them with people outside your team, with controls like PIN protection, expiration dates, and view analytics. The editor needed to be powerful. The sharing needed to be secure. Collaboration was a "nice to have."

That changed about three months in, when our early users started asking the same question in different ways: "Can my teammate and I work on this at the same time?" The first time, we said "not yet." The fifth time, we realized that even in a product focused on external sharing, the creation process is almost always collaborative. Someone drafts, someone reviews, someone adds the pricing section. If that workflow requires copy-pasting between documents or waiting for one person to finish before the next person can start, you've introduced friction into the very workflow the product is supposed to streamline.

So we built real-time collaboration. This is the story of how — the technical decisions we made, the tradeoffs we accepted, and the things we got wrong before we got them right.

The First Decision: OT or CRDT

If you've spent any time researching collaborative editing, you've encountered the two dominant approaches to conflict resolution: Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDT). The choice between them is the first architectural fork in the road, and it shapes everything that follows.

The core problem both approaches solve is deceptively simple to state: two users edit the same document at the same time. User A inserts the word "hello" at position 5. User B deletes the character at position 3. By the time each user's change reaches the other, the document state has already moved. If you naively apply the operations as they arrive, the positions are wrong, characters appear in the wrong places, and the document diverges into two incompatible states. The goal is eventual consistency — both users should end up seeing the same document, regardless of the order their operations arrive.

OT solves this by transforming operations against each other. When User B's delete arrives at User A's client, the system recognizes that User A has already inserted text before position 3, so User B's delete position needs to be adjusted. A central server acts as the single source of truth, ordering operations and broadcasting the transformed results to all clients. This is how Google Docs works, and it's been battle-tested at massive scale for over a decade.

CRDT takes a different approach. Instead of transforming operations, CRDTs change the data structure itself so that operations are commutative — they produce the same result regardless of the order they're applied. Each character gets a unique identifier and a fractional position between its neighbors, so insertions and deletions never conflict at the data level. CRDTs can work peer-to-peer without a central server, which makes them attractive for offline-first applications.

We chose a server-authoritative OT-based approach, and here's why.

First, Quixli's architecture already has a server. Every document lives on our backend. Sharing links, access controls, view analytics, version history — all of these features require a server. The peer-to-peer advantage of CRDTs wasn't relevant to our architecture.

Second, our editor supports rich text — not plain text. This is the critical distinction that most OT-vs-CRDT blog posts gloss over. Rich text editing involves operations on a tree structure (paragraphs containing inline elements containing text nodes), not a flat character sequence. OT operations like "split this paragraph at position 12" or "apply bold to characters 5 through 15" carry semantic intent that maps naturally to how editors work internally. CRDT operations at the character level lose that higher-level intent, and reconstructing it adds complexity that we didn't want to take on.

Third, CRDTs come with memory overhead. Tombstone-based CRDTs (like WOOT) never truly delete characters — they mark them as invisible, and the document's internal representation grows monotonically over the session's lifetime. For long-lived documents with many edits, this produces measurable performance degradation. Non-tombstone approaches (like Logoot) avoid this but introduce their own complexity around identifier allocation. OT with a central server keeps the document representation clean — what you see is what's stored.

The tradeoff we accepted: OT requires a server connection. If the server goes down, real-time collaboration stops. We mitigate this with local buffering (operations queue locally during brief disconnections and replay when the connection resumes), but we can't support true offline editing with automatic merge. For our use case — teams co-editing documents in real-time before sharing them externally — this tradeoff is acceptable. Our users are online when they're collaborating; the offline scenario is handled by a single-user editing path with auto-save and version conflict detection on reconnect.

WebSocket Architecture and Operation Flow

With OT chosen, the next question was how to get operations from clients to the server and back in real-time. The answer, unsurprisingly, is WebSockets — but the details of how you structure the WebSocket communication matter more than the choice of transport.

Our collaboration service runs as a stateful process that maintains an in-memory representation of each active document. When a user opens a document for editing, the client establishes a WebSocket connection to the collaboration service, which loads the document into memory (if it's not already there from another connected user) and sends the current state to the client.

The operation flow works like this. When User A types a character, the editor produces an operation (say, "insert 'x' at position 47 in paragraph 3"). This operation is applied immediately to User A's local document — the latency between keystroke and visual feedback must be zero, or the editor feels broken. The operation is then sent to the server over the WebSocket connection.

The server receives the operation, checks it against the current server state, transforms it if necessary against any operations that were applied between User A's last acknowledged state and the current state, applies the transformed operation to the authoritative document, and broadcasts it to all other connected clients. When User B receives this operation, their client transforms it against any local operations that User B has made but the server hasn't yet acknowledged, and applies the result to User B's local document.

This is the classic OT control algorithm, and the implementation details are well-documented in the academic literature. The engineering challenge isn't the algorithm itself — it's everything around it.

Connection management was our first real headache. WebSocket connections drop. They drop because of network switches, because of laptop sleep/wake cycles, because of mobile browsers background-throttling tabs, and because of load balancer timeouts. Every disconnection is a potential consistency break — the client has a local document state that may have diverged from the server. We handle this with a reconnection protocol that sends the client's last acknowledged operation version on reconnect. The server then sends back all operations that occurred after that version, and the client replays them to catch up. If the gap is too large (the user was disconnected for a long time with significant changes on both sides), we fall back to a full state sync rather than trying to replay hundreds of operations.

Operation batching was the second challenge. A fast typist generates operations at 5–10 per second. Broadcasting each one individually to every connected client creates a lot of network chatter. We batch operations in 50-millisecond windows — all operations generated within a 50ms window are combined into a single compound operation and sent as one message. This reduces network overhead by roughly 70% during fast typing without introducing perceptible latency (50ms is below the threshold where humans notice delay in a text editor).

Cursor Presence and Awareness

Real-time collaboration isn't just about the document — it's about knowing that other people are there. Cursor presence (seeing other users' cursors and selections in the document) is the feature that makes collaboration feel real rather than abstract.

Implementing cursor presence is architecturally separate from document operations. Cursor positions are ephemeral — they don't change the document, they don't need to be persisted, and they don't need the consistency guarantees of OT. We treat cursor updates as a separate channel of communication over the same WebSocket connection.

When a user moves their cursor or changes their selection, the client sends a presence update containing the user's identifier, their display name, a color assignment (more on this in a moment), and their cursor position expressed as a document path (paragraph index, text offset). These presence updates are broadcast to all other connected clients, which render colored cursors and selection highlights in the editor.

Color assignment sounds trivial but has surprising edge cases. You need enough distinct colors that five simultaneous collaborators are easily distinguishable, but the colors need to work as both a cursor color and a selection highlight background — which means they need sufficient contrast against white text backgrounds while being light enough as backgrounds that black text remains readable on top of them. We settled on a palette of eight carefully chosen hues, assigned round-robin based on connection order.

The update frequency for cursor presence is throttled to 100ms — we don't need to send a cursor update for every keystroke. The cursor jumps to its new position on the remote client, which is visually acceptable because cursor movement isn't as sensitive to smoothness as, say, a drag operation.

Version History: The Collaboration Safety Net

Version history predated our real-time collaboration feature, but collaboration changed how it needed to work. In a single-user editing flow, version history is straightforward: snapshot the document state periodically, let the user browse snapshots and restore any previous version. With multiple simultaneous editors, version history becomes more nuanced.

The core question is: what constitutes a "version"? Snapshotting every operation is too granular — a document with 10,000 operations doesn't need 10,000 versions in the history. Snapshotting on a fixed time interval (every 5 minutes) misses important moments — a user might make a critical change at minute 4:58 and then their collaborator overwrites it at 5:01, and the 5-minute snapshot at 5:00 captured neither the critical change nor the overwrite.

We use a hybrid approach. The system creates automatic snapshots based on activity heuristics: after a burst of editing activity followed by a pause (suggesting a natural "save point"), after a user disconnects (capturing the state when someone stops contributing), and at regular intervals as a fallback. Users can also create manual snapshots — named versions that serve as explicit bookmarks in the document's history.

Each snapshot stores the full document state, not a diff. This costs more storage but makes restoring a version instant — we don't need to replay a chain of diffs from some base state, which is both slow and fragile if any link in the chain is corrupted. For a document editor where individual documents are typically under 1MB, the storage cost of full snapshots is negligible.

Restoring a version in a collaborative context requires care. If User A restores the document to a version from two hours ago while User B is actively editing, User B's in-progress work could be lost. We handle this by treating a version restore as a regular document operation that flows through the OT system — it's applied and broadcast like any other change. But we also show a notification to all connected users ("User A restored the document to version 'Final Draft'") and auto-create a snapshot of the pre-restore state so that nothing is irreversibly lost. The pre-restore snapshot is labeled clearly in the version history, giving any user a one-click path to undo the restore if needed.

What We Got Wrong

No engineering retrospective is complete without the mistakes, and we made several.

Our first implementation of conflict resolution had a subtle bug with paragraph splitting. When two users simultaneously press Enter at different positions in the same paragraph, the paragraph needs to split into two (or three) paragraphs with the correct text distribution. Our initial OT transform function handled the position arithmetic correctly but failed to account for the paragraph's formatting metadata — one of the resulting paragraphs would inherit the original's heading level while the other would silently revert to body text. This bug survived testing because our test suite was focused on text content, not formatting attributes. We caught it two weeks after launch when a user reported that their headings kept "disappearing" during collaborative editing.

We also underestimated the cost of presence updates at scale. Our initial implementation broadcast cursor positions to all connected clients, regardless of whether the cursor was visible in their viewport. For documents with 8–10 simultaneous editors, this generated a meaningful amount of unnecessary network traffic and DOM updates (rendering off-screen cursors that no one can see). We added viewport-aware presence filtering — the client reports its visible document range, and the server only sends presence updates for cursors that fall within or near that range.

The third mistake was more philosophical than technical. We initially built collaboration as an always-on feature — if you could edit a document, you could see everyone else editing in real-time. Feedback quickly told us that some users wanted to work on a draft privately before making their changes visible. We added the concept of editing sessions with a visibility toggle — you can work on a document with collaboration visible (see others' cursors, share your cursor) or in a focused mode where your presence isn't broadcast and you don't see others' cursors, though the underlying OT sync still operates to prevent conflicts.

Performance Under Load

Collaborative editing has performance characteristics that are fundamentally different from typical web application workloads. The collaboration service is stateful (each document is an in-memory state machine), latency-sensitive (operations need sub-100ms round-trip), and memory-bound (each active document consumes memory proportional to its size and the number of buffered operations).

We profile along three axes: operation latency (time from keystroke to appearance on a collaborator's screen), memory per active document, and maximum concurrent editors per document.

Operation latency in production averages 45–80ms for users on the same continent as our servers, which is well within the threshold where collaboration feels instantaneous. Cross-continent latency adds 100–150ms of network round-trip, which is noticeable but acceptable — the local-first application of operations means the editing user never feels lag; only the remote reflection has the additional latency.

Memory per active document is typically 2–5MB, which includes the document state, the operation buffer (last 1,000 operations for transform purposes), and the connection state for each editor. A single collaboration server process can comfortably handle 500–1,000 active documents concurrently with standard cloud instance sizing.

We've tested with up to 25 concurrent editors on a single document, which works but starts to generate enough operation traffic that batching and transform computation become non-trivial. In practice, our users rarely exceed 5–8 simultaneous editors, and the system performs well within that range.

What We'd Do Differently

If we were starting from scratch today, two things would change.

First, we'd invest in better operational testing infrastructure earlier. Our initial test suite covered individual transform functions exhaustively but didn't simulate realistic multi-user editing sessions at scale. The paragraph-splitting bug and others like it only surfaced under patterns that are hard to anticipate in unit tests — two users making structurally similar edits to the same region of a document at nearly the same time. We've since built a fuzzing harness that generates random concurrent editing sessions and checks for consistency violations, and it catches classes of bugs that deterministic tests miss entirely.

Second, we'd design the collaboration protocol to be editor-agnostic from day one. Our initial implementation was tightly coupled to our specific editor's internal document model, which made it harder to adapt when we later refactored the editor for better performance. A cleaner separation between the collaboration layer (operations, transforms, transport) and the editor layer (document model, rendering, user input) would have saved us several weeks of migration work. This is advice that sounds obvious in retrospect but is hard to follow when you're building fast and the editor and collaboration system are evolving simultaneously.

The Result

Today, real-time collaboration in Quixli works the way you'd expect: open a document, share the link with your teammate, and both of you can type, format, embed media, and restructure the document simultaneously. Cursors show where everyone is working. Changes appear in real-time. Version history captures the evolution of the document so you can always roll back.

But collaboration is only half the story. The feature that makes Quixli different isn't that multiple people can edit a document together — most modern document tools can do that. The difference is what happens after the collaborative editing is done. You take the finished document and share it externally with PIN protection, an expiration date, and analytics that tell you whether the recipient actually read it. The collaboration is the means. The shareable, secure, trackable document is the end.

If you're building a document tool and wrestling with the OT-vs-CRDT decision, here's the honest summary: if your architecture already has a server and your editor supports rich text, OT is the pragmatic choice. If you need offline-first peer-to-peer editing on simple data types, CRDTs are elegant and proven. Neither is universally better — they solve the same problem with different tradeoffs for different architectural contexts.

And if you just want to write a document with your team and then share it with someone who matters, you don't need to care about any of this. You just need Quixli.

Start collaborating →