BayesIQ
← Back to blog

2026-03-05

The 5 telemetry failures every product team makes

Most product analytics break in the same five ways. Here's how to spot them, measure them, and fix them before they corrupt your metrics.

Every product team we've worked with believes their telemetry is mostly correct. After running structured audits across dozens of products, we can say with confidence: it isn't. The failures are almost always the same five, and they compound each other.

This post names them, shows you how to detect them with concrete numbers, and tells you exactly what to do about each one.


The 5 failures at a glance

FailureWhat it breaksHow to detectFix
1. Missing required fieldsMetric queries silently drop rowsNull-rate query per event typeEnforce schema at ingestion; alert on >1% null
2. Event naming driftFunnels break; cohorts fragmentAudit distinct event_name valuesEnforce a naming convention; lint against spec
3. Identity stitching gapsUser-level metrics are wrongCheck user_id null rate pre/post loginResolve anonymous → identified IDs in pipeline
4. Duplicate event inflationConversion rates overreportCount events per session per event typeAdd deduplication key; dedupe before aggregation
5. Schema type violationsPipelines coerce silentlyInspect field value distributionsValidate types at ingestion; reject malformed events

Failure 1 — Missing required fields

Symptom

Your purchase_completed event fires. Revenue dashboards update. Everyone assumes the data is correct. Then someone asks for revenue broken down by currency, and the query returns NULL for 22% of rows.

Root cause

A required field (currency) was added to the logging spec but the mobile implementation missed the update. The event fires — it just fires incomplete. No error is thrown. No alert fires. Dashboards that don't use the field look fine.

Detection

Run a null-rate check on every required field for your top conversion events:

SELECT
  COUNT(*) AS total_events,
  COUNTIF(user_id IS NULL) AS missing_user_id,
  COUNTIF(currency IS NULL) AS missing_currency,
  ROUND(COUNTIF(currency IS NULL) / COUNT(*) * 100, 2) AS currency_null_pct
FROM events
WHERE event_name = 'purchase_completed'
  AND event_timestamp >= CURRENT_DATE - INTERVAL '14 days';

In a recent audit, this query surfaced 22% null rate on currency across 14 days of purchase_completed events. The field had been required in the spec for 3 months. Finance had been reporting revenue by currency using the remaining 78% and assuming it was representative.

Threshold: Any required field above 1% null rate on a conversion event is a P0 finding.

Fix

  1. Identify which clients (web, iOS, Android) are sending the null field.
  2. Patch the implementation and deploy.
  3. Add a schema validation step at ingestion that rejects or quarantines events missing required fields — not just logs them.
  4. Flag the affected metric window in your dashboards so historical comparisons account for the gap.

Failure 2 — Event naming drift

Symptom

Your activation funnel shows a 12% week-over-week drop. No product change happened. The engineering team is confused. The data team assumes it's real.

Root cause

Someone renamed user_signed_up to signup_completed during a codebase refactor. Both names fire in production simultaneously for two weeks while the rollout completes. Funnel queries that filter on one name miss the other. The event didn't stop firing — the name changed, and nobody updated the downstream queries.

Detection

Audit distinct event names against your spec:

-- Events firing in production but not in the spec
SELECT
  event_name,
  COUNT(*) AS occurrences,
  MIN(event_timestamp) AS first_seen,
  MAX(event_timestamp) AS last_seen
FROM events
WHERE event_timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY event_name
ORDER BY occurrences DESC;

Compare this list against your logging spec. Any event firing in production with no spec entry is a finding. In a representative audit, we typically find 5–8 undocumented event names firing at meaningful volume — usually renamed events, legacy events that were never cleaned up, or events added by a third-party integration.

A related failure: casing drift. Page_Viewed, page_viewed, and page-viewed are three distinct event names. An 8% match-rate drop in a funnel query was traced to exactly this: iOS sent Page_Viewed (title case) while web sent page_viewed (snake case) after a shared analytics utility was rewritten.

Fix

  1. Define and enforce a naming convention (snake_case, verb-noun, max depth).
  2. Add a linting step to your CI that validates event names against the spec before merging tracking code.
  3. When renaming events, keep the old name firing in parallel for at least one full release cycle, and update all downstream queries before sunsetting.
  4. Lock event names in version control alongside your codebase — not in a shared spreadsheet.

Failure 3 — Identity stitching gaps

Symptom

Your activation rate looks healthy. But when you filter the same cohort by user_id, the numbers don't match. User-level funnels and session-level funnels tell different stories.

Root cause

Events fired before a user logs in use an anonymous ID (anonymous_id or device_id). Events fired after login use user_id. If the pipeline doesn't stitch these together, the user's pre-login journey is invisible at the user level. Activation funnels that require a full pre-login → post-login path silently drop users who didn't complete the first step in a logged-in session.

In a recent audit, 15–25% of activated users had pre-login events that were never linked to their user_id. Activation rate was overstated because the denominator (users who started the funnel) was undercounted — those users' anonymous events were never matched.

Detection

-- Users with post-login events but no stitched pre-login events
SELECT
  user_id,
  MIN(event_timestamp) AS first_logged_in_event,
  COUNTIF(anonymous_id IS NOT NULL) AS events_with_anon_id,
  COUNTIF(user_id IS NOT NULL AND anonymous_id IS NULL) AS events_logged_in_only
FROM events
WHERE event_timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY user_id
HAVING events_with_anon_id = 0
  AND first_logged_in_event IS NOT NULL
ORDER BY first_logged_in_event
LIMIT 100;

If users appear in this query with no stitched pre-login events, your identity graph is incomplete.

Fix

  1. Implement an identify call in your client SDK at the moment of login — this creates the anonymous → user_id mapping.
  2. In your pipeline, run a stitching pass that back-fills user_id onto historical anonymous events for each user, using the earliest known mapping.
  3. Validate stitching coverage: the percentage of user_id-keyed sessions that have at least one pre-login anonymous predecessor should match your expected signup-from-anonymous rate.
  4. If you're using a third-party SDK (Segment, Amplitude, Mixpanel), verify the alias or identify call is being made correctly — it's frequently misconfigured.

Failure 4 — Duplicate event inflation

Symptom

Conversion rate is higher than it should be. Engineering says they haven't changed anything. The numbers look great in the dashboard, but don't match what finance reports from the source-of-truth system.

Root cause

Events are being counted multiple times. Common causes: SDK retry logic fires the event again after a network timeout; a page reload on a checkout flow re-submits the form; client-side events don't have a deduplication key, so the pipeline ingests every copy.

In a recent audit, checkout_started had an 11% duplicate rate — 1 in 9 "checkout starts" was a duplicate event from the same session. Conversion rate from checkout to purchase looked 9 points higher than reality.

Detection

-- Sessions with more than one occurrence of a supposedly unique event
SELECT
  session_id,
  event_name,
  COUNT(*) AS event_count
FROM events
WHERE event_timestamp >= CURRENT_DATE - INTERVAL '7 days'
  AND event_name IN ('signup_completed', 'checkout_started', 'purchase_completed')
GROUP BY session_id, event_name
HAVING COUNT(*) > 1
ORDER BY event_count DESC
LIMIT 50;

Cross-reference with business logic: some events legitimately fire multiple times per session (e.g., page_viewed). The ones that shouldn't — signup_completed, purchase_completed, subscription_started — are the ones to focus on.

Fix

  1. Add a message_id or event_id field to every event payload — a UUID generated client-side at the moment the event fires.
  2. Deduplicate on message_id at ingestion, before events reach any downstream table.
  3. For events that don't yet have a deduplication key, use a composite of (session_id, event_name, timestamp_truncated_to_minute) as a temporary deduplication proxy while you add proper message_id support.
  4. After deduplication is in place, retroactively compute the corrected metrics for the affected historical window and document the adjustment.

Failure 5 — Schema type violations

Symptom

Your revenue metrics don't add up. A query that sums price returns a number that's off by a factor of 100 on some rows. Or your pipeline silently rounds decimal values because it coerces a string "19.99" to an integer 19.

Root cause

The client sends a field as the wrong type. The pipeline doesn't reject it — it coerces, truncates, or silently drops it. Common examples:

  • price sent as a string ("29.99") instead of a float (29.99) — coercion rounds it.
  • timestamp sent in local timezone instead of UTC — time-series aggregations are off by hours.
  • user_id populated with an anonymous session ID — joins fail silently.
  • plan_type contains "pro", "Pro", "PRO", and "professional" — all intended to mean the same thing.

In one audit, a coercion issue in a revenue pipeline had been accumulating for 6 months: price was sent as a string from one client platform. The pipeline cast it to INT64, truncating all decimal values. The underreporting was approximately 3% of total revenue — small enough to be within normal variance, large enough to matter.

Detection

-- Check for non-numeric values in a price field
SELECT DISTINCT price, COUNT(*) AS occurrences
FROM events
WHERE event_name = 'subscription_started'
  AND NOT REGEXP_CONTAINS(CAST(price AS STRING), r'^\d+(\.\d{1,2})?$')
GROUP BY price
ORDER BY occurrences DESC
LIMIT 50;

For enum fields, audit distinct values:

SELECT DISTINCT plan_type, COUNT(*) AS occurrences
FROM events
WHERE event_name = 'subscription_started'
  AND event_timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY plan_type
ORDER BY occurrences DESC;

Unexpected values in the result set are findings.

Fix

  1. Define explicit types for every field in your logging spec — not just names and descriptions.
  2. Add a schema validation step at ingestion that rejects events with type violations rather than coercing them. Route rejected events to a quarantine table for inspection.
  3. For enum fields, define the allowed value set in the spec and enforce it at the client level (not just the pipeline level) using a shared constants file.
  4. Audit the last 90 days of affected fields retroactively to quantify the impact of any historical coercion, and document the corrected value.

Telemetry validation checklist

Use this before any major product launch, A/B test instrumentation, or analytics refactor. It's also a reasonable quarterly health check.

Event naming and versioning

  • All event names follow the agreed convention (e.g., snake_case, verb-noun pattern)
  • No undocumented events firing in production (spec vs. production comparison is clean)
  • Event renames are deployed with a parallel-fire period and downstream queries updated

Required fields

  • Null rate on every required field is below 1% for all conversion events
  • Required fields are validated at the client before the event fires (not just at ingestion)
  • New fields added to the spec are deployed across all client platforms simultaneously

Identity stitching

  • identify / alias calls fire at the correct moment (login, signup, account merge)
  • Pipeline stitching pass back-fills user_id onto pre-login anonymous events
  • Stitching coverage is measured and above expected threshold

Timestamps and timezones

  • All timestamps are sent in UTC
  • Server-side timestamps are used for server-side events; client timestamps are used only where necessary and labeled as such
  • Late-arrival rate (events arriving >24 hours after event_timestamp) is below 5%

Deduplication

  • Every event payload includes a message_id (UUID) generated at fire time
  • Deduplication runs on message_id before events reach downstream tables
  • Duplicate rates for conversion events are measured and below 0.5%

Schema and type validation

  • Field types are explicitly defined in the logging spec
  • Type violations are rejected at ingestion (not silently coerced)
  • Enum fields have an explicit allowed-value list enforced at the client

Schema drift monitoring

  • Null-rate alerts are configured for required fields on conversion events (threshold: >1%)
  • New event names that don't match the naming convention trigger a CI lint failure
  • A schema versioning or changelog mechanism exists so drift is detectable over time

Why these five keep recurring

These failures aren't caused by careless teams. They happen because analytics instrumentation sits at the boundary between product engineering and data engineering, and that boundary is almost never owned clearly. Product engineers add events to ship features. Data engineers build metrics on top of those events. Neither team has full visibility into the other's assumptions.

The result: the spec drifts from reality, the pipeline compensates silently, and the dashboards look plausible until someone asks a question the data can't actually answer.

The fix isn't a new tool. It's process: treat your logging spec like a schema, validate it like a contract, and test it like code.


Get a telemetry audit

If you recognize more than two of these failures in your current setup, the underlying data is probably worse than you think. These issues compound — a duplicate event with a missing required field and an identity stitching gap means the metric is wrong in three independent ways simultaneously.

Here's what a BayesIQ telemetry audit looks like:

  • Timeline: 1–2 weeks, primarily async
  • What we do: Systematically validate your telemetry against your logging spec, run null-rate and duplicate-rate analysis, audit your identity stitching, check schema types, and compare your pipeline output to raw event counts
  • What you get: A severity-ranked findings report (P0–P3) with root cause analysis, a concrete fix plan for each issue, and a remediation timeline your engineering team can execute

We typically find P0 issues — metrics that are materially wrong right now — within the first 48 hours.

Book a telemetry audit →

If you'd rather start with a conversation, send us your logging spec or describe what you're seeing. We'll tell you whether it warrants a full audit or whether there's a quick answer.

Get in touch →