Fig. Users train their own machine learning models and upload to the server for aggregation.
Fig. Pipelines for one training iteration in conventional training and PiPar when using a split DNN. “Comp” is an abbreviation for computation. 𝑓, 𝑏, 𝑢 and 𝑑 represent forward pass, backward pass, upload and download, respectively. Superscripts indicate server-side (𝑠) or client-side (𝑐) computation or communication.
Author: Zihan Zhang
Many modern intelligent systems—such as virtual assistants, smart keyboards, health monitoring apps, and adaptive interfaces—are built around machine learning models that learn from user behavior. These systems increasingly rely on learning from real people, in real contexts, to provide effective, personalized experiences. However, using user data raises an immediate question: how can we design learning systems that are not only intelligent but also respectful of user privacy and control?
Traditionally, user data is sent to cloud servers to train models. This centralized approach, while powerful, often violates user expectations around privacy and autonomy. As a response, collaborative machine learning (CML) has emerged. CML shifts the training process from the cloud to the edge, allowing user data to stay on personal devices while still contributing to shared learning goals. In this paradigm, people are not just passive data sources—they are active participants in a distributed learning system.
But this shift introduces a new challenge: inefficiency. Most existing CML methods require devices and servers to wait on each other, which creates significant idle time and slows down training. In applications where user needs evolve quickly, such as real-time language correction or adaptive accessibility tools, this delay becomes a serious limitation. To fully support human-centered AI systems, we need collaborative learning approaches that are not only privacy-preserving, but also efficient, responsive, and scalable.
Background:
CML refers to training machine learning models across multiple devices and a central server without sharing raw data. Here’s how some well-known techniques work:
Federated Learning (FL): Each device trains a full model on its local data and periodically sends updates to the server, which aggregates them. The server is idle while waiting for device updates.
Split Learning (SL): Each model is divided between the device and the server. Devices handle the first few layers and send intermediate results to the server, which finishes training. Devices operate one at a time, leading to even more idle time.
Split Federated Learning (SFL): A hybrid approach where devices work in parallel but still experience delays due to sequential server-device interactions.
This inefficient structure stands in the way of real-time learning systems that must adapt quickly to user behavior.
Problem:
All three methods are fundamentally limited by sequential computation and communication, resulting in inefficient use of both computing power and network bandwidth.
To make CML more effective in practice, especially for user-facing applications, it is crucial to reduce the idle time on both devices and servers. The goal is to allow these components to work simultaneously, rather than waiting for one another. Additionally, communication (such as sending data across a network) often blocks computation, meaning both sides stop working during transmission.
By addressing these inefficiencies, it’s possible to accelerate model training significantly while maintaining privacy.
Pipeline parallelism:
We introduce a new method, PiPar, which organises the training process in collaborative machine learning by treating it like an assembly line. In traditional approaches, training steps are performed one at a time—first a device runs a forward pass to compute an intermediate result, then it sends that result to the server, then the server processes it, and so on. In PiPar, instead of processing one mini-batch of data at a time in sequence, the system handles multiple mini-batches at once in a staggered fashion.
Here’s what that looks like in practice. Each device processes a small batch of data and performs a forward pass through the early layers of the model. As soon as it finishes that, it sends the intermediate result to the server and immediately starts working on the next mini-batch. The server, in turn, begins processing the first mini-batch as soon as it arrives, while more batches are on their way from the device. When the server finishes computing gradients for a mini-batch, it sends the results back, and the device can then complete the backward pass whenever it’s ready. Since multiple batches are in motion at once, and computation and communication are happening at the same time, neither side has to sit idle waiting for the other.
This setup allows the devices to work continuously, performing forward passes on new data while waiting for gradient information from earlier batches. Similarly, the server stays busy processing incoming mini-batches from devices instead of waiting for a full round to complete. The training process becomes a pipeline where different stages are always active, just like how different workers on an assembly line handle different parts of a product. By overlapping device-side computation, server-side computation, and the communication between them, PiPar turns downtime into productive work and drastically reduces overall training time.
Conclusion:
Although PiPar is a system-level technique, its impact is very relevant to human-centered computing:
Faster Personalization: Applications like smart keyboards, activity trackers, or assistive technologies can be trained more frequently and responsively.
Respect for Privacy: Data never leaves the user’s device, aligning with growing expectations for ethical AI.
Energy and Resource Efficiency: Mobile devices, often constrained by battery and compute power, benefit from offloading part of the workload efficiently.
PiPar transforms collaborative machine learning from a step-by-step process into a continuous, efficient flow, enabling systems to learn faster while preserving user privacy. This has profound implications for human–computer interaction. Faster training means more immediate personalization—apps and devices can adapt to individual users in real time. Because PiPar keeps user data local, it supports ethical, privacy-respecting design. It also helps ensure that on-device learning won’t drain the user’s battery or rely on perfect network conditions.
As more intelligent systems embed learning in everyday interactions, people are no longer just beneficiaries—they are participants in the learning loop. PiPar offers a practical path toward realizing human-centered, privacy-preserving, and adaptive AI systems that are ready for the realities of personal computing.
Reference:
[1] Zihan Zhang, Philip Rodgers, Peter Kilpatrick, Ivor Spence, Blesson Varghese, PiPar: Pipeline parallelism for collaborative machine learning, Journal of Parallel and Distributed Computing, Volume 193, 2024