Section B · Core DS

Time-Series & Signals

The signal-processing craft Archetype's role is built around — preprocessing, time-series features, alignment, filtering, frequency-domain views, and video-data prep.

Why this matters at Archetype

The Archetype JD names this skill explicitly: "Hands-on experience with raw time-series sensor data and/or video," "Iterative data preprocessing cycle execution with baseline comparison," "Time-series and signal visualization." The work is taking raw sensor traces (often messy, noisy, asynchronous across sources) and producing features Newton can reason about.

Preprocessing & cleaning

Common defects in raw sensor data

Missing timestamps: sensor dropped a packet.
Duplicate timestamps: sensor sent the same reading twice.
Clock drift: sensor's clock vs server clock differ by hundreds of ms.
Saturation: values clipped at the sensor's max.
Stuck values: same reading for many samples (sensor failure).
Spikes: single-sample outliers from electrical noise.
Drift: gradual sensor calibration loss.

Cleaning recipe

Deduplicate on timestamp + sensor_id.
Sort by timestamp.
Identify and fill gaps explicitly (with NaN or interpolation, with a flag).
Cap or remove outliers (winsorize, clip, or robust z-score).
Detect "stuck sensor" runs and flag them.
Resample to a uniform rate if downstream models require it.

basic sensor cleaning

import pandas as pd
import numpy as np

def clean_sensor(df: pd.DataFrame, value_col: str, ts_col: str) -> pd.DataFrame:
    # 1. dedupe + sort
    df = df.drop_duplicates([ts_col]).sort_values(ts_col).reset_index(drop=True)
    # 2. flag stuck runs
    df['stuck'] = (df[value_col].diff() == 0).rolling(window=10).sum() >= 9
    # 3. winsorize spikes (robust)
    q_low, q_high = df[value_col].quantile([0.001, 0.999])
    df[value_col] = df[value_col].clip(q_low, q_high)
    # 4. uniform resample to 1Hz, forward-fill short gaps
    df = df.set_index(ts_col).asfreq('1S').ffill(limit=3).reset_index()
    return df

Time-series features

For supervised modeling over sensor/time-series data, manual feature extraction is often more interpretable and reliable than end-to-end deep learning. The standard set:

Aggregates over windows: mean, median, std, min, max, quantiles. Over fixed-length windows or trailing windows.
Counts and rates: number of events above threshold per minute; rate of change.
Lag features: value at t−1, t−5, t−60.
Differences and ratios: short-window vs long-window mean ratio (a tactical version of trend detection).
Cross-sensor: correlation between two sensors over a window, lag of one against another.
Frequency-domain: dominant frequency, spectral energy in specific bands.
Statistical: skewness, kurtosis, autocorrelation at specified lags.

tsfresh, sktime, and similar libraries auto-generate hundreds of features. Useful for baselines; the staff-bar move is then identifying which 10–20 features actually drive the signal and discarding the rest.

Alignment & resampling

When you have multiple sensors with different rates and clocks, alignment is the first hard problem:

Pick a reference clock (one sensor or the server clock) and align everything to it.
Resample to a common rate. Upsample with interpolation (linear, cubic, or "as-of merge"); downsample with aggregation (mean, max, last).
Handle clock drift: if clocks drift, align by event landmarks (cross-correlate to find best offset) rather than naive timestamp.
Watch interpolation choice: interpolating a categorical sensor reading is meaningless; use forward-fill or last-observation-carried-forward instead.

Asof merge

For "what was sensor B's reading at the time of each sensor A event," pd.merge_asof is the right tool — much cleaner than reindex + interpolate. Specify direction='backward' and a tolerance to bound staleness.

Filtering & denoising

Three families to know:

Moving average (boxcar): simple, introduces lag. Median filter is more robust to outliers.
Exponentially weighted moving average: weights recent samples more. Familiar from finance and IoT.
Butterworth low-pass / high-pass / band-pass: frequency-domain filtering with sharper cutoffs. scipy.signal.butter + filtfilt is the standard.

butterworth low-pass

from scipy.signal import butter, filtfilt

def low_pass(signal, fs, cutoff_hz, order=4):
    nyq = 0.5 * fs
    b, a = butter(order, cutoff_hz / nyq, btype='low')
    return filtfilt(b, a, signal)  # zero-phase

Use filtfilt (forward-backward filtering) to avoid introducing phase delay. The cost is non-causality — you're filtering with future samples. Fine for offline; for online prediction, use a regular causal filter.

Frequency-domain

For periodic or quasi-periodic signals, the frequency domain often surfaces structure that's invisible in time. scipy.signal.welch for power spectral density; scipy.signal.spectrogram for time-frequency views.

When this matters:

Motor / equipment health (specific frequencies indicate specific wear).
Activity recognition from accelerometer / gyroscope.
Network traffic patterns.
Anything with seasonality at multiple timescales.

Video data prep

For video, the equivalent of "clean the time series" is:

Frame extraction at the right rate. Newton or downstream models may not use every frame; pick a sample rate that captures the dynamics without wasting compute (3–10 Hz is common for slowly changing scenes; 30 Hz for fast motion).
Resolution / cropping. Most pretrained vision models expect specific input sizes. Crop to regions of interest if the customer cares about a specific zone.
Annotation handling. Bounding boxes, segmentation masks, action labels — track formats and timestamps consistently. Encord and similar tools manage this.
Synchronization with sensor streams. If video has a timestamp and sensors do too, align both to a single clock; verify with a known event visible in both streams.

Interview probes

Show probe 1: "Customer sends video with a 24h sensor stream. Walk me through prep."

"(1) Sample video at the rate that captures the dynamics — 5 Hz for slow scenes, 30 for fast. (2) Align clocks: find a shared event visible in both streams (someone walking past, a known input change) and use cross-correlation to find offset. (3) Resample sensors to a uniform rate that matches the video. (4) Clean sensors — dedupe, sort, winsorize, flag stuck runs. (5) Build a small eval set with the customer's expected outputs. (6) Then run Newton on a small slice and iterate prompts."

Show probe 2: "What features would you extract from a 10 Hz accelerometer trace for activity classification?"

Over rolling windows: mean, std, min, max, median, IQR per axis. Magnitude (root-sum-square across axes), and its moments. Cross-axis correlations. Frequency-domain: dominant frequency, spectral energy in walking-rate band (1–3 Hz) vs running-rate band (2–4 Hz). Zero-crossing rate. tsfresh or sktime can auto-generate; then I'd prune to the 15–20 most predictive via feature importance.

Show probe 3: "Two sensors have clock drift. How do you align them?"

Pick a shared event landmark visible in both streams — a known input, a manual marker, a synchronized pulse if available. Cross-correlate the streams in a window around the landmark to find the best alignment offset. Apply that offset (or a slowly-varying one if drift accumulates). Verify on a second landmark to confirm.

Show probe 4: "What's the difference between filtfilt and a regular filter?"

filtfilt applies the filter forward and backward, canceling phase delay — useful offline. A causal filter only uses past samples, introducing some phase delay — required for real-time prediction where you can't wait for future samples. The wrong choice in the wrong setting either looks great offline and breaks in production, or introduces lag in offline analysis that distorts results.

Show probe 5: "When would you reach for frequency-domain features?"

When the signal has periodic or quasi-periodic structure — motor health, activity recognition, network traffic, anything with seasonality at meaningful timescales. Time-domain features can capture energy and bursts; frequency-domain captures structure that produces those bursts. Both together is usually best for sensor classification.