When to Use Actigraphy and When to Use Machine Learning for Sleep Analysis

Classic actigraphy algorithms and machine learning models often read the exact same input: the raw motion stream from a wrist-worn device, and increasingly the raw photoplethysmography (PPG) and respiration signals alongside it. The interesting question is not which family of methods is better in the abstract. It is which one fits the study question you are trying to answer, the labeled data you actually have, and the constraints you have to operate under. This post lays out a practical way to decide.

Two ways to read the same signal

Classic actigraphy on raw signals

For three decades, automated sleep scoring meant feeding activity counts into a fixed, hand-tuned formula. The Cole-Kripke algorithm scores each epoch from a weighted sum of activity in surrounding epochs and reaches roughly 88 percent sleep versus wake agreement against polysomnography (PSG)1. The Sadeh algorithm uses a different feature set and reported 91 to 93 percent agreement in its validation samples2. Both are deterministic, transparent, and still in everyday use.

The shift to raw data accelerometry kept the heuristic spirit but dropped the proprietary count. Count-free methods such as the van Hees HDCZA algorithm detect the sleep period window from the variance in estimated arm angle, with no training labels required, and reach a c-statistic near 95 percent against sleep diaries in a large cohort3. The appeal of this whole family is consistent: the logic is inspectable, the same code behaves predictably across device brands and populations, compute is trivial, and the validation literature is deep. The hard limit is equally clear. Motion alone cannot reliably separate REM, light, and deep sleep. These methods are built for sleep versus wake and for sleep period estimates such as total sleep time, sleep onset, and wake after sleep onset.

Machine learning on raw signals

Machine learning replaces the hand-tuned formula with parameters learned from labeled PSG. Random forests trained on wrist accelerometry reach an F1 score of about 74 percent for sleep versus wake on held-out subjects4. Notably, the authors of that work are candid that raw accelerometry does not by itself beat traditional actigraphy on accuracy, and that staging from motion alone stays difficult; raw data’s real advantage is transparency and the ability to reprocess the same recording for many purposes4. The accuracy gains arrive when you add signals. Adding PPG heart rate and a circadian clock proxy to motion, neural networks scored 90 percent of epochs correctly for sleep versus wake and about 72 percent for the three-way split of wake, NREM, and REM5. The lesson is that ML earns its keep when you want multi-class staging and when you can fuse modalities.

Where staging accuracy actually comes from

Our own work at Centralive is built around that fusion principle. A compact architecture combining a convolutional neural network with a bidirectional long short-term memory network, fed PPG supplemented with respiratory input, reached 92.7 percent accuracy at 2-stage classification (Cohen’s kappa 0.768), 80.2 percent at 3-stage (kappa 0.714), 76.8 percent at 4-stage (kappa 0.550), and 76.7 percent at 5-stage (kappa 0.616), all on raw data using only a few inexpensive sensors6. The respiration channel is what lifts staging past what motion or PPG can deliver on their own, and the model is light enough to be practical outside the lab.

The direction we are heading points at a synthesis rather than a winner. A forthcoming hybrid model pairs a transformer with a Hidden Markov Model, blending learned representations with the structured temporal priors that classic sequence methods always relied on. Sleep stages do not transition at random, and encoding that structure on top of a learned feature extractor is where the heuristic tradition and the machine learning tradition meet.

A decision framework

Reach for classic actigraphy heuristics when:

  • Your outcome is sleep versus wake or a sleep period metric (total sleep time, sleep onset, wake after sleep onset), not the stage.
  • You run large or multi-cohort studies and need one method to behave consistently across populations and device brands.
  • You have little or no labeled PSG for your target population.
  • Transparency and reproducibility carry weight, whether for regulatory review, clinical audit, or open science.
  • You need on-device, low-power scoring with a small footprint.

Reach for machine learning when:

  • You need sleep staging across light, deep, and REM, not just sleep versus wake.
  • You can fuse modalities such as PPG, heart rate variability, respiration, and accelerometry.
  • You have labeled PSG that covers your target population and your hardware.
  • Accuracy on a specific population matters more than portability across cohorts.
  • You can commit to honest epoch-by-epoch validation against PSG.

Validate the same way regardless of method

Whichever family you pick, the evaluation discipline is identical. Report Cohen’s kappa and epoch-by-epoch agreement against PSG rather than headline accuracy alone, because high overall accuracy can hide a model that collapses on the wake class, a known weakness of count-based scoring. Test on held-out subjects, not just held-out epochs from subjects the model has already seen, since the latter badly overstates real-world performance. And watch for distribution shift when you move to a new device, a new wear location, or a new clinical population, because that is where both heuristics and learned models quietly degrade.

Raw access is the precondition for both

Neither approach works on vendor-summarized stages. A heuristic needs the raw motion or angle stream, and a learned model needs the raw waveforms it was trained on. That makes platform choice a research decision, not an afterthought. Garmin stands out as a first-class research platform here: raw signal access through the Health Companion SDK, no subscription gate sitting between you and the data, and an accessible hardware price point that scales to real cohort sizes. Form factor is a separate axis from data access. Rings and watches differ in wear compliance and signal quality, but the question that determines whether either scoring family is even possible is always the same one, which is whether you can get to the raw signal.

Bottom line

Use classic actigraphy heuristics when you need robust, transparent, portable sleep versus wake and sleep period estimates at scale. Use machine learning when you need multimodal sleep staging and you have the labeled data to back it. The most promising path forward is neither one alone but the hybrid, where learned features carry the signal and structured temporal priors keep the stage sequence physiologically honest.


Sign up for the Centralive Newsletter: https://newsletter.centralive.health/signup

References

  1. Cole RJ, Kripke DF, Gruen W, Mullaney DJ, Gillin JC. Automatic sleep/wake identification from wrist activity. Sleep. 1992;15(5):461-469. doi:10.1093/sleep/15.5.461
  2. Sadeh A, Sharkey KM, Carskadon MA. Activity-based sleep-wake identification: an empirical test of methodological issues. Sleep. 1994;17(3):201-207. doi:10.1093/sleep/17.3.201
  3. van Hees VT, Sabia S, Jones SE, et al. Estimating sleep parameters using an accelerometer without sleep diary. Sci Rep. 2018;8:12975. doi:10.1038/s41598-018-31266-z
  4. Sundararajan K, Georgievska S, te Lindert BHW, et al. Sleep classification from wrist-worn accelerometer data using random forests. Sci Rep. 2021;11:24. doi:10.1038/s41598-020-79217-x
  5. Walch O, Huang Y, Forger D, Goldstein C. Sleep stage prediction with raw acceleration and photoplethysmography heart rate data derived from a consumer wearable device. Sleep. 2019;42(12):zsz180. doi:10.1093/sleep/zsz180
  6. Kazemi K, Abiri A, Zhou Y, Rahmani AM, Khayat RN, Liljeberg P, Khine M. Improved sleep stage predictions by deep learning of photoplethysmogram and respiration patterns. Comput Biol Med. 2024;179:108679. doi:10.1016/j.compbiomed.2024.108679