SM BigQuery Analyst

Query SourceMedium-hosted BigQuery data safely with auditable SQL receipts.

Quick Start (Copy/Paste)

Copy the block below and paste it into your coding agent to install and verify setup automatically.

Install the SourceMedium BigQuery analyst skill:

1. Run: npx skills add source-medium/skills --skill sm-bigquery-analyst
2. Read the installed SKILL.md to understand the workflow
3. Run the setup verification commands to check my BigQuery access

My tenant ID is: [your-tenant-id]
My project is: sm-[your-tenant-id]

Install

npx skills add source-medium/skills --skill sm-bigquery-analyst

Or copy the SKILL.md content below into your coding agent’s skills folder.

What this skill does

Setup verification — validates gcloud/bq CLI, authentication, and table access
Safe queries — SELECT-only, dry-run first, cost-guarded
SQL receipts — every answer includes copy/paste SQL + verification command
No fabrication — if access fails, returns exact error and access request template

Example Questions

After installing, ask your coding agent:

What was my revenue by channel last month?

Show me new customer acquisition by source over the past 30 days

What's my customer LTV by acquisition cohort? Use sm_order_line_type = 'all_orders'

Your agent will verify access, generate SQL, and return an auditable receipt.

SKILL.md

Copy everything below into .claude/skills/sm-bigquery-analyst/SKILL.md (or equivalent for your agent):

---
name: sm-bigquery-analyst
description: Query SourceMedium-hosted BigQuery safely. Emits SQL receipts. SELECT-only, cost-guarded. Use when users need help with BigQuery setup, access verification, or analytical questions against SourceMedium datasets.
compatibility: Requires gcloud CLI, bq CLI, and network access to BigQuery.
metadata:
  author: sourcemedium
  version: "1.0"
---

# SourceMedium BigQuery Analyst

Use this skill to help end users work with SourceMedium BigQuery data from setup to analysis.

## Workflow

1. **Verify environment** (run these before any analysis)
2. Confirm project and dataset/table visibility
3. Use docs-first guidance for definitions and table discovery
4. Answer analytical questions with reproducible SQL receipts
5. Call out assumptions and caveats explicitly

## Setup Verification

Run these commands in order before writing analysis SQL:

~~~bash
# 1. Check CLI tools are installed
gcloud --version && bq version

# 2. Check authenticated account
gcloud auth list

# 3. Check active project
gcloud config get-value project

# 4. Validate BigQuery API access (dry-run)
bq query --use_legacy_sql=false --dry_run 'SELECT 1 AS ok'

# 5. Test table access (your project is named sm-<tenant_id>)
#    Example: if your tenant is "acme-corp", your project is sm-acme-corp
bq query --use_legacy_sql=false --dry_run "
  SELECT 1 
  FROM \`sm-<tenant_id>.sm_transformed_v2.obt_orders\` 
  LIMIT 1
"

# 6. Confirm you can actually read data (not just dry-run)
bq query --use_legacy_sql=false "
  SELECT 1
  FROM \`sm-<tenant_id>.sm_transformed_v2.obt_orders\`
  WHERE is_order_sm_valid = TRUE
  LIMIT 1
"
~~~

If any step fails, direct the user to request access from their internal admin.

## Safety Rules

These are hard constraints. Do not bypass.

### Query Safety

1. **SELECT-only** — deny: INSERT, UPDATE, DELETE, MERGE, CREATE, DROP, EXPORT, COPY
2. **Dry-run first** when iterating on new queries:
   ~~~bash
   bq query --use_legacy_sql=false --dry_run '<SQL>'
   ~~~
3. **Enforce cost limit** with maximum bytes billed:
   ~~~bash
   bq query --use_legacy_sql=false --maximum_bytes_billed=1073741824 '<SQL>'
   ~~~
   (1GB = 1073741824 bytes. If it fails due to bytes billed, tighten filters or ask for approval.)
4. **Always bound queries**:
   - Add `LIMIT` clause (max 100 rows for exploratory)
   - Use date/partition filters when querying partitioned tables
   - Prefer `WHERE` filters on partition columns

### Data Safety

1. **Default to aggregates** — avoid outputting raw rows unless explicitly requested
2. **PII handling**:
   - Do not output columns likely containing PII (email, phone, address, name) without explicit confirmation
   - If PII is requested, confirm scope and purpose before proceeding
   - Prefer anonymization. Example:
     ~~~sql
     -- Hash PII instead of exposing raw values
     SELECT TO_HEX(SHA256(LOWER(email))) AS email_hash, ...
     ~~~

### Cost Guardrails

~~~sql
-- Good: bounded scan
SELECT ... FROM `sm-<tenant_id>.sm_transformed_v2.obt_orders`
WHERE DATE(order_processed_at_local_datetime) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
LIMIT 100

-- Bad: full table scan
SELECT ... FROM `sm-<tenant_id>.sm_transformed_v2.obt_orders`  -- no filters
~~~

## Output Contract

For analytical questions, always return:

1. **Answer** — concise plain-English conclusion
2. **SQL (copy/paste)** — BigQuery Standard SQL used for the result
3. **Notes** — timeframe, metric definitions, grain, scope, timezone, attribution lens
4. **Verify** — `bq query --use_legacy_sql=false --dry_run '<SQL>'` command
5. **Bytes scanned** — if >1GB, note this and ask for approval before running

If access/setup fails, do not fabricate results. Return:

1. Exact failing step
2. Exact project/dataset that failed
3. Direct user to the BigQuery access request template

## Query Guardrails

1. Fully qualify tables as `` `sm-<tenant_id>.dataset.table` ``
2. For order analyses, default to `WHERE is_order_sm_valid = TRUE`
3. Use `sm_store_id` (not `smcid` — that name does not exist in customer tables)
4. Use `SAFE_DIVIDE` for ratio math
5. Handle DATE/TIMESTAMP typing explicitly (`DATE(ts_col)` when comparing to dates)
6. Use `order_net_revenue` for revenue metrics (not `order_gross_revenue` unless explicitly asked)
7. **Prefer `*_local_datetime` columns** when available for date-based reporting; otherwise be explicit about UTC vs local
8. **For enumerations** (channel, platform, status), discover values with `SELECT DISTINCT` first, then use exact match. Reserve `LIKE`/`REGEXP` for free-text fields (`utm_campaign`, `product_title`, `page_path`)
9. **LTV tables (`rpt_cohort_ltv_*`)**: always filter `sm_order_line_type` to exactly ONE value

## Key Tables

| Table | Grain | Use case |
|-------|-------|----------|
| `obt_orders` | 1 row per order | Revenue, profitability, channel analysis |
| `obt_order_lines` | 1 row per line item | Product performance, margins, COGS |
| `obt_customers` | 1 row per customer | Acquisition, retention, subscription status |
| `rpt_ad_performance_daily` | 1 row per channel/date | Ad spend, impressions, clicks |
| `rpt_cohort_ltv_*` | 1 row per cohort x month | LTV analysis (filter sm_order_line_type!) |

## Project & Filtering

**Project naming:** Customer projects are named `sm-<tenant_id>`. Example: `sm-acme-corp`.

**Multi-store filtering:** If you have multiple stores, filter by `sm_store_id` to analyze a single store. Without this filter, all stores' data is combined.

**Sales channels:** Always filter or group by `sm_channel` for segmentation:
- `online_dtc` — Direct-to-consumer website
- `amazon` — Amazon marketplace
- `tiktok_shop` — TikTok Shop

## Example Queries

### Daily revenue by channel

~~~sql
SELECT
  DATE(order_processed_at_local_datetime) AS order_date,
  sm_channel,
  COUNT(sm_order_key) AS order_count,
  SUM(order_net_revenue) AS revenue
FROM `sm-<tenant_id>.sm_transformed_v2.obt_orders`
WHERE is_order_sm_valid = TRUE
  AND DATE(order_processed_at_local_datetime) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY 1 DESC
~~~

### New customer acquisition

~~~sql
SELECT
  DATE(order_processed_at_local_datetime) AS order_date,
  sm_utm_source_medium,
  COUNT(DISTINCT sm_customer_key) AS new_customers,
  SUM(order_net_revenue) AS revenue,
  SAFE_DIVIDE(SUM(order_net_revenue), COUNT(DISTINCT sm_customer_key)) AS avg_first_order_value
FROM `sm-<tenant_id>.sm_transformed_v2.obt_orders`
WHERE is_order_sm_valid = TRUE
  AND order_sequence = '1st_order'
GROUP BY 1, 2
ORDER BY 1 DESC
~~~

### LTV cohort (CRITICAL: filter sm_order_line_type)

~~~sql
SELECT
  cohort_month,
  months_since_first_order,
  AVG(SAFE_DIVIDE(cumulative_order_net_revenue, cohort_size)) AS avg_ltv
FROM `sm-<tenant_id>.sm_transformed_v2.rpt_cohort_ltv_by_first_valid_purchase_attribute_no_product_filters`
WHERE sm_order_line_type = 'all_orders'
  AND acquisition_order_filter_dimension = 'source/medium'
  AND months_since_first_order <= 12
GROUP BY 1, 2
ORDER BY 1, 2
~~~

### Discover available metrics

Query `sm_metadata.dim_semantic_metric_catalog` to find 180+ pre-defined metrics:

~~~sql
-- Find all revenue metrics
SELECT metric_name, metric_label, calculation
FROM `sm-<tenant_id>.sm_metadata.dim_semantic_metric_catalog`
WHERE metric_category = 'revenue'
ORDER BY metric_name

-- Resolve abbreviated names (aov, mer, cac, roas)
SELECT metric_name, preferred_metric_name, metric_description
FROM `sm-<tenant_id>.sm_metadata.dim_semantic_metric_catalog`
WHERE metric_name IN ('aov', 'mer', 'cac', 'roas')
~~~

No access yet?

If you cannot run queries due to permissions, see BigQuery Access Request Template.

BigQuery Essentials

Setup and first-query fundamentals.

SQL Query Library

SourceMedium SQL templates and patterns.

Table Docs

Schema-level documentation for core tables.

Metric Catalog

180+ pre-defined metrics with calculations.

Data Dictionary

Table availability, freshness, and column stats.

Multi-Touch Attribution

MTA models and experimental tables.

Getting Started

Workflows

Diagnostics

Agent Skills

Reference

Quick Start (Copy/Paste)

Install

What this skill does

Example Questions

SKILL.md

No access yet?

BigQuery Essentials

SQL Query Library

Table Docs

Metric Catalog

Data Dictionary

Multi-Touch Attribution

Getting Started

Workflows

Diagnostics

Agent Skills

Reference

​Quick Start (Copy/Paste)

​Install

​What this skill does

​Example Questions

​SKILL.md

​No access yet?

​Related docs

BigQuery Essentials

SQL Query Library

Table Docs

Metric Catalog

Data Dictionary

Multi-Touch Attribution

Quick Start (Copy/Paste)

Install

What this skill does

Example Questions

SKILL.md

No access yet?

Related docs