Section C · Coding

Python for Data

pandas, polars, when to leave SQL and reach for Python. The Python coding round for analytics engineers is usually about data manipulation, not LeetCode.

When to reach for Python (vs staying in SQL)

  • External APIs / sources — fetching from REST, scraping, vendor SDKs.
  • Custom transformations SQL can't express cleanly — text parsing, complex string ops, statistical computations.
  • Iterative algorithms — sessionization with custom logic, ML preprocessing, graph algorithms.
  • One-off analysis in a notebook — exploratory, throwaway, fast feedback.
  • Glue between systems — orchestration scripts, alerting, custom ETL steps.

Default principle: if SQL can do it cleanly, do it in SQL. SQL is more testable, more portable, more easily reviewed by analysts. Python is the fallback when SQL can't or shouldn't.

pandas essentials

The lingua franca. Still expected even if you prefer polars.

pandas cheat sheet
import pandas as pd

# Read / write
df = pd.read_csv('file.csv')
df = pd.read_parquet('file.parquet')
df.to_parquet('out.parquet', index=False)

# Selection
df[df['amount'] > 100]                           # filter rows
df[['col1', 'col2']]                             # select columns
df.loc[df['col'] == 'x', 'other_col'] = 'y'      # conditional update
df.query("status == 'active' and amount > 100")  # readable filter

# Aggregation
df.groupby('customer_id')['revenue'].sum()
df.groupby(['region', 'date']).agg(
    total_revenue=('revenue', 'sum'),
    order_count=('order_id', 'count'),
    avg_value=('amount', 'mean'),
)

# Joins
pd.merge(left, right, on='customer_id', how='left')
pd.merge(left, right, left_on='lid', right_on='rid', how='inner')

# Window-ish operations
df.sort_values(['customer_id', 'ordered_at'])
df.groupby('customer_id')['revenue'].cumsum()             # running total
df.groupby('customer_id')['revenue'].rolling(7).mean()    # 7-row moving avg
df['rank'] = df.groupby('customer_id')['revenue'].rank(ascending=False)

# Reshape
df.pivot_table(index='customer_id', columns='quarter', values='revenue', aggfunc='sum')
df.melt(id_vars=['customer_id'], var_name='quarter', value_name='revenue')

# NaN handling
df.fillna(0)
df.dropna(subset=['email'])
df['col'].isna().sum()

# Date / time
df['date'] = pd.to_datetime(df['date_str'])
df['date'].dt.month
df['date'].dt.dayofweek
pandas at scale

pandas loads data into memory. 10M+ rows starts hurting. For larger data, push back into the warehouse (SQL), or switch to polars / DuckDB / Dask.

polars — the modern alternative

Polars is pandas-but-faster-and-saner. Built in Rust, uses Arrow, supports lazy evaluation. Many teams are migrating to it.

polars cheat sheet
import polars as pl

df = pl.read_parquet('file.parquet')

# Filter + select + aggregate in one expression chain
result = (
    df
    .filter(pl.col('status') == 'active')
    .group_by(['customer_id'])
    .agg(
        pl.col('revenue').sum().alias('total_revenue'),
        pl.col('order_id').count().alias('order_count'),
    )
    .sort('total_revenue', descending=True)
)

# Window functions feel natural in polars
df = df.with_columns(
    pl.col('revenue').sum().over('customer_id').alias('lifetime_revenue'),
    pl.col('ordered_at').rank().over('customer_id').alias('order_rank'),
)

# Lazy evaluation — query plan optimized
lazy = pl.scan_parquet('huge_file.parquet')
result = (
    lazy
    .filter(pl.col('date') >= '2024-01-01')
    .group_by('region')
    .agg(pl.col('revenue').sum())
    .collect()  # actually execute
)

Mention you've used or are aware of polars. Interview-grade fluency: "polars is the modern alternative; lazy evaluation and the expression API are nicer than pandas for production data work. For interactive exploration, pandas is still fine."

SQL inside Python

Often the cleanest pattern: do as much as possible in SQL, fetch the result, do final transformations in Python.

SQL → pandas
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('snowflake://...')

query = """
SELECT customer_id, SUM(revenue) AS total_revenue
FROM orders
WHERE ordered_at >= %(start_date)s
GROUP BY customer_id
"""

df = pd.read_sql(query, engine, params={'start_date': '2024-01-01'})

Or with parameterized SQL strings — for production code, use the warehouse-specific driver (snowflake-connector-python, google-cloud-bigquery) directly.

DuckDB — SQL on Python data

Reach for DuckDB when you want SQL ergonomics against pandas/polars data, or against local Parquet files. Fast.

DuckDB on a DataFrame
import duckdb
import pandas as pd

df = pd.read_csv('orders.csv')

result = duckdb.sql("""
    SELECT customer_id, SUM(revenue) AS total
    FROM df
    GROUP BY customer_id
    QUALIFY ROW_NUMBER() OVER (ORDER BY SUM(revenue) DESC) <= 10
""").df()

DuckDB queries pandas/polars DataFrames natively. Useful when you want window functions or CTEs without leaving Python.

Common analytics-engineering Python patterns

API ingestion

api_to_warehouse.py
import requests
import pandas as pd
from snowflake.connector.pandas_tools import write_pandas

# Paginated fetch with retry
def fetch_all_pages(url, params, max_pages=100):
    rows = []
    for page in range(1, max_pages + 1):
        r = requests.get(url, params={**params, 'page': page}, timeout=30)
        r.raise_for_status()
        data = r.json()
        if not data['results']:
            break
        rows.extend(data['results'])
    return rows

rows = fetch_all_pages('https://api.example.com/v1/orders', {'since': '2024-01-01'})
df = pd.DataFrame(rows)

# Write to warehouse
write_pandas(snowflake_conn, df, table_name='RAW_ORDERS')

Deduplication in pandas

dedup_pandas.py
df_sorted = df.sort_values(['email', 'updated_at'], ascending=[True, False])
df_deduped = df_sorted.drop_duplicates(subset='email', keep='first')

Sessionization (when SQL is awkward)

sessions_pandas.py
df = df.sort_values(['user_id', 'event_ts'])
df['prev_ts'] = df.groupby('user_id')['event_ts'].shift(1)
df['new_session'] = (
    (df['event_ts'] - df['prev_ts']) > pd.Timedelta(minutes=30)
).fillna(True)
df['session_id'] = df.groupby('user_id')['new_session'].cumsum()

Dates and times

dates.py
df['ts'] = pd.to_datetime(df['ts'], utc=True)
df['date'] = df['ts'].dt.tz_convert('America/New_York').dt.date
df['month_start'] = df['ts'].dt.to_period('M').dt.to_timestamp()
df['day_of_week'] = df['ts'].dt.dayofweek  # 0 = Monday