Processed vs. Raw Sleep Data: Choosing the Right Wearable Sleep Pipeline for Your Study

One of the most consequential decisions in any wearable-based sleep study is also one of the least discussed up front: are you going to work with the vendor’s processed sleep summaries, or are you going to archive and analyze raw accelerometry (and ideally raw PPG/HR) yourself? The choice shapes everything downstream — sample size, storage budget, reproducibility, what you can publish, and how grant reviewers respond to your proposal.

There’s no universally correct answer. There is, however, a defensible answer for any given study, and getting it right early saves enormous pain later. Below is a practical framework for thinking through it, with specific attention to the kinds of digital health deployments we work on at Centralive.

When to Use Processed (Vendor) Sleep Data

You need scale, not granularity

Large cohort studies, free-living deployments, and longitudinal monitoring programs that care about night-to-night and week-to-week trends in total sleep time, sleep efficiency, or rough stage proportions are well-served by vendor outputs. Fitbit, Apple, Oura, Garmin, and Withings algorithms are good enough for population-level patterns and behavioral correlations, even if they’re imperfect on any given epoch [1][2].

The outcome is behavioral or self-management focused

Just-in-time adaptive interventions (JITAIs), ecological momentary assessment (EMA) triggers, digital phenotyping pipelines, and patient-facing dashboards generally use processed summaries — both because that’s what users see in their own apps and because it aligns with the device’s intended use. Mismatched data layers between what the participant sees and what the intervention engine acts on is a recipe for confused users and brittle logic.

You don’t have the infrastructure (or IRB scope) for raw streams

Raw accelerometry at 25–100 Hz balloons storage fast — a single participant can generate gigabytes per month — and requires validated processing pipelines, secure transfer, and longer-term archival planning. If your IRB protocol, data management plan, or cloud budget can’t absorb that, processed summaries are the honest answer.

You’re benchmarking against PSG at the summary level

Most of the validation literature — Chinoy et al.’s multi-device comparisons [1], de Zambotti et al.’s reviews [2], and the Menghini et al. consensus on standardized evaluation [3] — evaluates vendor outputs against polysomnography (PSG) epoch-by-epoch. If you want your numbers to be comparable to that body of work, you should be using what they used.

When to Use Raw Accelerometry (and Ideally Raw PPG/HR)

You need algorithmic transparency or reproducibility

Vendor algorithms are black boxes that change silently with firmware updates — a real and frequently documented threat to longitudinal studies [4]. Raw data lets you apply open, version-controlled algorithms (van Hees’ GGIR [5], Cole-Kripke [6], Sadeh [7], or your own models) consistently across time and devices. If a participant’s “deep sleep” number jumps 15% mid-study because Fitbit pushed a firmware update, you want to be able to detect that — or sidestep it entirely.

You’re developing or validating new methods

Novel sleep-stage classifiers, arousal detection, movement-based biomarkers, or anything where you need ground-truth-comparable features — all of this requires raw signals. You can’t reverse-engineer a new algorithm out of a vendor’s 30-second summary.

You’re studying populations where vendor algorithms underperform

Older adults, people with fragmented sleep, shift workers, and clinical populations — insomnia, sleep apnea, depression — see meaningful performance degradation with vendor algorithms, which are typically trained on healthy young adults [2][8]. Raw data lets you retrain or recalibrate scoring rules for the population you actually care about.

You need cross-device harmonization

If a study uses multiple wearables, processed outputs aren’t comparable: different epoch lengths, different stage definitions, different proprietary scoring conventions. Raw accelerometry processed through a common pipeline (e.g., GGIR or MAD-based scoring [9]) gives you a defensible apples-to-apples comparison across brands.

You care about features beyond sleep itself

Movement during sleep, posture, sedentary behavior bleeding into bedtime, micro-awakenings, restlessness as a depression or anxiety marker — these are derived features that require raw signals. Vendor summaries throw them away.

A Practical Hybrid for Centralive-Style Deployments

For most of the work we see at Centralive, the right answer is to collect both when feasible.

Store processed sleep summaries for real-time intervention logic and user-facing feedback — that’s what keeps the deployment responsive and what participants actually see in their apps. In parallel, archive raw accelerometry (plus raw HR/PPG if the device exposes it) for retrospective analysis, reproducibility, and method development.

A quick reality check on what each device platform actually exposes:

  • Apple Watch via HealthKit: processed outputs only
  • Empatica, ActiGraph, Verisense, Movisens: raw access
  • Fitbit Sense/Versa: limited raw access via the SDK
  • Garmin Health API: now exposes some raw streams

This dual-archive posture is also what grant reviewers increasingly expect for digital biomarker work. NIH’s PAR-25-170 in particular looks favorably on raw-data archival plans [10], and reviewers are getting more sophisticated about asking how you’ll guard against algorithmic drift over a multi-year study.

One Caveat Worth Flagging in Any Proposal

If you state in a grant or protocol that you’ll use processed vendor sleep, expect reviewers to ask about firmware drift and the lack of algorithmic transparency. Have an answer ready — ideally some combination of:

  • Firmware and app version logging at the participant level
  • Sensitivity analyses around known algorithm changes
  • A raw-data fallback for a subset of participants

The point isn’t that processed data is wrong — it often isn’t. The point is that the field has matured enough that “we’ll use the Fitbit numbers” without further qualification is no longer a complete answer.

The Bottom Line

Match the data layer to the question you’re actually trying to answer. Processed summaries are right for behavior change, population trends, and user-facing experiences. Raw signals are right for method development, clinical populations, cross-device studies, and anywhere reproducibility matters more than convenience. When in doubt — and when your infrastructure allows — capture both.

References

  1. Chinoy, E. D., Cuellar, J. A., Huwa, K. E., Jameson, J. T., Watson, C. H., Bessman, S. C., Hirsch, D. A., Cooper, A. D., Drummond, S. P. A., & Markwald, R. R. (2021). Performance of seven consumer sleep-tracking devices compared with polysomnography. Sleep, 44(5), zsaa291. https://doi.org/10.1093/sleep/zsaa291
  2. de Zambotti, M., Cellini, N., Goldstone, A., Colrain, I. M., & Baker, F. C. (2019). Wearable sleep technology in clinical and research settings. Medicine and Science in Sports and Exercise, 51(7), 1538–1557. https://doi.org/10.1249/MSS.0000000000001947
  3. Menghini, L., Cellini, N., Goldstone, A., Baker, F. C., & de Zambotti, M. (2021). A standardized framework for testing the performance of sleep-tracking technology: step-by-step guidelines and open-source code. Sleep, 44(2), zsaa170. https://doi.org/10.1093/sleep/zsaa170
  4. Roomkham, S., Lovell, D., Cheung, J., & Perrin, D. (2018). Promises and challenges in the use of consumer-grade devices for sleep monitoring. IEEE Reviews in Biomedical Engineering, 11, 53–67. https://doi.org/10.1109/RBME.2018.2811735
  5. van Hees, V. T., Sabia, S., Anderson, K. N., Denton, S. J., Oliver, J., Catt, M., Abell, J. G., Kivimäki, M., Trenell, M. I., & Singh-Manoux, A. (2015). A novel, open access method to assess sleep duration using a wrist-worn accelerometer. PLOS ONE, 10(11), e0142533. https://doi.org/10.1371/journal.pone.0142533
  6. Cole, R. J., Kripke, D. F., Gruen, W., Mullaney, D. J., & Gillin, J. C. (1992). Automatic sleep/wake identification from wrist activity. Sleep, 15(5), 461–469. https://doi.org/10.1093/sleep/15.5.461
  7. Sadeh, A., Sharkey, K. M., & Carskadon, M. A. (1994). Activity-based sleep-wake identification: an empirical test of methodological issues. Sleep, 17(3), 201–207. https://doi.org/10.1093/sleep/17.3.201
  8. Depner, C. M., Cheng, P. C., Devine, J. K., Khosla, S., de Zambotti, M., Robillard, R., Vakulin, A., & Drummond, S. P. A. (2020). Wearable technologies for developing sleep and circadian biomarkers: a summary of workshop discussions. Sleep, 43(2), zsz254. https://doi.org/10.1093/sleep/zsz254
  9. van Hees, V. T., Gorzelniak, L., Dean León, E. C., Eder, M., Pias, M., Taherian, S., Ekelund, U., Renström, F., Franks, P. W., Horsch, A., & Brage, S. (2013). Separating movement and gravity components in an acceleration signal and implications for the assessment of human daily physical activity. PLOS ONE, 8(4), e61691. https://doi.org/10.1371/journal.pone.0061691
  10. National Institutes of Health. (2025). PAR-25-170: Methodology and measurement in the behavioral and social sciences. https://grants.nih.gov/grants/guide/pa-files/PAR-25-170.html

Sign up for the Centralive Newsletter: https://newsletter.centralive.health/signup