Not All Wearable Metrics Are Created Equal: A Three-Tier Framework for Clinical Use

May 9, 2026

The wearable health industry has spent the last decade chasing a single question: are these devices accurate enough to matter? After hundreds of validation studies and several landmark clinical trials, the honest answer is it depends on the metric. Heart rate at rest is nearly clinical-grade. Calorie counts are essentially noise. Treating all wearable data with the same level of trust is the single biggest mistake clinicians, researchers, and product teams continue to make.

The framework below — drawn from our deeper analysis of the wearable validation literature — collapses years of meta-analyses into three buckets defined by what the metric is good for, not by what the marketing claims.

Tier 1: Clinical Endpoints

The first tier is reserved for metrics that have crossed the threshold from “interesting consumer data” to defensible clinical input. Two metrics qualify today: AFib screening in low-risk adults and resting heart rate.

The evidence here is the strongest in the entire wearable literature. The Apple Heart Study (N=419,297) produced an 84% positive predictive value for irregular-pulse notifications, and both Apple and Fitbit hold FDA clearance for atrial fibrillation notification. Resting heart rate accuracy is similarly strong — the 2025 living meta-analysis of Apple Watch reported a mean bias near −0.12 bpm versus ECG, well inside the AAMI ±5 bpm standard, and Chevance’s pooled analysis of 52 Fitbit studies landed at −2.99 bpm.

What makes these “Tier 1” is that the metrics support absolute clinical thresholding. A resting heart rate of 95 bpm means something on its own. An irregular-pulse notification triggers a defined clinical workflow. The number doesn’t need a longitudinal baseline to be useful — it can drive a decision today.

Important caveat: “low-risk adults” is doing real work in that phrase. A 2025 multicenter Heart Rhythm O2 study found Apple Watch missed roughly 1 in 3 AF episodes when non-sinus, non-AF rhythms were included in the population. Tier 1 status is conditional on the population matching the validation cohort.

Tier 2: Intra-Individual Tracking

The second tier covers metrics where the absolute number carries meaningful bias, but the delta — change in the same person over time — is medically relevant. VO2max and sleep stages sit here.

Wearable VO2max correlates well with cardiopulmonary exercise testing — the INTERLIVE consortium meta-analysis found a pooled bias of just −0.09 ml·kg⁻¹·min⁻¹ for exercise-based algorithms. But the limits of agreement span ±9.83 ml·kg⁻¹·min⁻¹, and in trained athletes the Garmin Forerunner 245 underestimated by 4–5 ml·kg⁻¹·min⁻¹. A single Garmin VO2max reading is not a substitute for CPET. A six-month trend in that same person’s VO2max is a legitimate signal worth acting on.

Sleep staging follows the same pattern. The 2024 Brigham and Women’s polysomnography study found Oura, Fitbit, and Apple Watch all hit ≥95% sensitivity for sleep-vs-wake — but four-stage classification dropped to Cohen’s κ between 0.55 and 0.65, and Apple Watch overestimated light sleep by 45 minutes on average. You should not diagnose a sleep disorder from a wearable’s deep-sleep number. You can absolutely use that number to flag a meaningful change in someone’s sleep architecture over weeks or months.

The framing for Tier 2 metrics in any clinical or research context: report deltas, not absolutes.

Tier 3: Behavioral Signals

The third tier is for metrics that are most useful for behavior change, motivation, and ranking individuals within a cohort — not for clinical thresholding or even reliable individual tracking. Step counting and active heart rate are the canonical examples.

Step counts are accurate enough for healthy adults walking at typical cadences. Apple Watch step MAPE was 1.83% in the 2025 meta-analysis; Fitbit Charge MAPE stayed under 25% across 20 studies. That’s good enough to motivate someone to hit 10,000 steps, to compare a wellness cohort against itself, or to detect a major drop in baseline activity.

Where this tier breaks down is in special populations. Stroke survivors walking under 0.35 m/s saw the Garmin Vivofit undercount by 68.2% on the paretic wrist. People using rollators or walkers showed 31% wrist error versus 1.5% at the ankle, because consumer accelerometers measure wrist swing, not foot strikes. Pediatric step counts diverged 35% from research-grade actigraphy. Active heart rate degrades sharply during vigorous exercise — the Garmin Forerunner 225 hit 24% MAPE during treadmill running.

The honest use of Tier 3 metrics is to drive behavior change interventions and rank individuals within a population, while flagging clearly that absolute values carry heavy caveats for slow walkers, children, and anyone using mobility aids.

What Gets Left Out

One omission from the framework is worth naming: energy expenditure. Calorie estimates didn’t make the chart because they don’t make any tier. Every major review puts wearable calorie MAPE above 30%, with some Apple Watch comparisons reaching 30–155% versus metabolic carts. There is no consumer device validated as accurate for calories burned. Despite their prominence in weight-loss apps and activity rings, calorie counts should be treated as a motivational fiction, not a measurement.

Consumer SpO2 in dark-pigmented skin sits in a similar place — biased by about 1.27 percentage points and not appropriate for clinical decisions, which prompted the FDA’s January 2025 draft guidance requiring Monk Skin Tone-diverse testing.

Why This Framework Matters

The wearable industry has moved from “interesting but unreliable” to “conditionally clinical-grade.” The conditions are the entire game. A clinician who treats a wearable AFib alert the same way they treat a wearable calorie count is going to make bad calls in both directions — over-reacting to noise, and under-reacting to signal.

For product teams, the framework defines what claims you can defend. For researchers designing digital health trials, it tells you which endpoints will survive peer review. For clinicians, it tells you which numbers in front of you are doing real work and which ones are decoration.

The next decade of wearable health won’t be won by devices that measure more things. It will be won by devices — and the teams deploying them — that are honest about which tier each measurement actually belongs in.

Citations

Chevance, G., et al. (2022). Accuracy and precision of energy expenditure, heart rate, and steps measured by combined-sensing Fitbits against reference measures: Systematic review and meta-analysis. JMIR mHealth and uHealth.
Choe, S. & Kang, M. (2025). Living meta-analysis of Apple Watch accuracy for heart rate and step counting.
Molina-Garcia, P., et al. (2022). A systematic review on biological, social and environmental determinants of wearable VO2max validity. INTERLIVE consortium.
Fuller, D., et al. (2020). Reliability and validity of commercially available wearable devices for measuring steps, energy expenditure, and heart rate: Systematic review. JMIR mHealth and uHealth.
Germini, F., et al. (2022). Accuracy and acceptability of wrist-wearable activity-tracking devices: Systematic review. Journal of Medical Internet Research.
Perez, M.V., et al. (2019). Large-scale assessment of a smartwatch to identify atrial fibrillation. New England Journal of Medicine (Apple Heart Study).
Lubitz, S.A., et al. Fitbit Heart Study.
BASEL Wearable Study (N=201). Sensitivity/specificity of Apple Watch 6, Samsung Galaxy Watch 3, Fitbit Sense, and AliveCor KardiaMobile for AF detection.
Heart Rhythm O2 (2025) multicenter study on Apple Watch AF sensitivity in non-sinus, non-AF rhythms.
Brigham and Women’s Hospital (2024). Single-night polysomnography validation of Oura Ring Gen3, Fitbit Sense 2, and Apple Watch Series 8.
JMIR (2024) meta-analysis of four wearable PR studies (140,771 paired PPG-ECG observations) on skin pigmentation bias.
FDA (January 2025). Draft guidance on pulse oximeter testing across diverse skin tones.

Not All Wearable Metrics Are Created Equal: A Three-Tier Framework for Clinical Use

Tier 1: Clinical Endpoints

Tier 2: Intra-Individual Tracking

Tier 3: Behavioral Signals

What Gets Left Out

Why This Framework Matters

Citations

Share:

More Posts

Rolling Windows for HRV: How to Tell a Meaningful Change From Daily Noise

NIH Launches PAR-25-338: Bridging the Gap Between Academia and Industry for Medical Diagnostics

Researching Behavioral and Cognitive Signals of Aging in Real-World Contexts

How Do Wearables Handle Unusual Sleep? Shift Work, Naps, Bed-Sharing, and Irregular Schedules