Methods and Data Documentation
0) Scope of the Atlas
- Intended purpose: cross-source adverse event signal exploration and evidence contextualization.
- Not intended for pharmacovigilance regulatory decision support as a standalone system.
- Not intended to estimate population incidence or real-world risk.
- Not intended to provide clinical guidance, diagnosis, treatment, or prescribing advice.
- Population limitations apply because data sources differ in inclusion criteria, reporting behavior, and completeness.
The Atlas is intended for hypothesis generation and exploratory analysis rather than confirmatory evidence.
1) Data sources
This section documents ingestion and interpretation for FAERS, clinical trials, and PubMed.
1.1 FAERS data
What FAERS is
FAERS is a spontaneous reporting system and is used for signal detection rather than causal incidence estimation.
Input files used
All FAERS data files are downloaded directly from the FDA's website.
https://fis.fda.gov/extensions/FPD-QDE-FAERS/FPD-QDE-FAERS.html
FAERS quarterly data files from Q1 2016 through Q4 2025 were included.
Case versioning logic
For each caseid, the latest primaryid version is retained.
Substance parsing
- Source: reported active ingredient
prod_ai - Normalization: lowercase, trim whitespace, remove blanks
Terminology: In AEAtlas, Substance name (active ingredient) is used consistently for this field.
prod_ai is a reporter-entered field and may include spelling variants, brand names, combinations, salt forms, or incomplete ingredient names.
Adverse Event parsing
- Source:
Preferred term (pt) - Normalization: lowercase, trim whitespace, remove blanks
Counting definition
Counts represent the number of unique case identifiers (caseid) in which a given substance-event pair was reported, after retaining the most recent version of each case.
MedDRA System Organ Classes (SOC) mapping
- Multiple SOCs per PT are collapsed to a joined list
Known limitations
- Under-reporting and over-reporting
- Notoriety bias and duplicate complexity
- Missing age/sex and concomitant medication confounding
- Indication/channeling bias
- Exposure denominators (number of patients receiving a drug) are not available in FAERS, preventing incidence estimation
1.2 Clinical trials (ClinicalTrials.gov results)
Database
Postgres: clinicaltrials.gov
ClinicalTrials.gov results data are aggregated arm-level summaries reported by study sponsors and do not represent individual participant data.
Substance to trial mapping
Intervention matching uses ILIKE %drug%. Limitations include substring collisions and missed brand-name variants unless included.
Intervention matching is heuristic and may misclassify trials when intervention names include multiple drugs, brand names, or descriptive phrases.
Included trials
Only studies involving drugs listed as approved by the European Medicines Agency (EMA) are included. The EMA list is downloaded here https://www.ema.europa.eu/en/medicines/download-medicine-data.
Only studies with results_first_posted_date IS NOT NULL. Meaning that only trials with posted results are included in the Atlas
Group logic
- Reported event groups:
EGxxx - Baseline groups:
BGxxx - Map EG to BG by normalized group-title equality
Group mapping relies on title normalization heuristics and may fail for complex study designs (e.g., crossover, extension phases, pooled arms).
Treatment vs control selection
- Control arm: title contains placebo/control/comparator/SOC patterns and does not mention drug
- Treatment arm: mentions drug and is not control-pattern arm
- Multiple candidates: choose highest
subjects_at_risk
Arm classification is rule-based and may not perfectly reflect the study's intended comparison structure, particularly in multi-arm or crossover designs.
Outcome extracted
Per trial AE: subjects_affected and subjects_at_risk.
Effect measure
For each trial and adverse event term:
a: affected participants in treatment armn_treat: participants at risk in treatment armc: affected participants in control armn_control: participants at risk in control arm
Interpretation:
RD > 0: higher AE risk in treatment armRD < 0: lower AE risk in treatment armRD = 0: equal observed risk
Approximate standard error and confidence interval (Wald form):
The UI currently displays the point estimate-oriented metrics and not trial-level RD confidence intervals.
Risk differences are unadjusted and do not account for baseline imbalances, stratification factors, or time-at-risk differences.
Sex and age extraction
Source table: baseline_measurements
- Sex: sum female and male rows per BG group
- Age mean:
param_type~ mean/average - Age SD:
param_type~ sd/stddev/standard deviation
These sex and age values describe baseline trial cohort demographics for selected arms and are not AE-specific participant subsets.
Limitations: median/IQR-only trials, missing SD, mislabelled param_type, multiple age-row semantics.
Known limitations
- Published-results subset only
- Selective reporting
- Heterogeneous AE coding and baseline reporting
- Differences in treatment duration and follow-up across trials are not adjusted for in the current metrics
1.3 PubMed (literature signal)
What is queried
Substance mention in title/abstract and adverse-event terms from 2016 onward. We currently only search for EMA approved drugs. The list can be download here https://www.ema.europa.eu/en/medicines/download-medicine-data .
Term extraction method
- Dictionary-based term matching derived from MedDRA Preferred Terms
- Abstract cleaning: lowercase, punctuation cleanup, header-line cleanup
- Match extraction from abstract blocks
Indication filtering
Terms matching therapeutic-area concepts are filtered to reduce indication-as-AE false positives.
Publication year
From entrez_summary() pubdate or epublishdate.
Known limitations
- Not a curated AE dataset
- False positives from contextual mention
- Title/abstract only, limited negation handling
- Literature mentions do not necessarily represent observed adverse events in study populations and may include speculative or background discussion
- Published literature may preferentially report unusual or severe events, introducing reporting bias
- Generic terms such as “serious adverse event” are not PT diagnoses
2) Terminology and mapping
2.1 MedDRA mapping & Organ system
SOC mapping is retained even when non-unique. A PT can validly map to multiple SOCs and is stored as a joined list.
2.2 What counts as an adverse event term in the Atlas?
Serious adverse event classification
AEAtlas does not include or display seriousness classifications (e.g., serious vs. non-serious adverse events).
Seriousness is a regulatory designation based on patient outcomes (such as death, hospitalization, life-threatening events, or disability) rather than a specific clinical diagnosis. Because AEAtlas is designed to analyze and compare clinical adverse event terms across studies and data sources, seriousness classifications were not incorporated into the current data model.
Seriousness is distinct from severity; severity reflects clinical intensity, while seriousness reflects regulatory outcomes.
Additionally, seriousness definitions and reporting practices vary across data sources (clinical trials, spontaneous reporting systems, and publications), which limits comparability. Excluding seriousness avoids introducing inconsistent or non-comparable metrics into the Atlas.
Future versions may incorporate seriousness as an optional filter if harmonized definitions and reliable cross-source mappings become available.
Substance reaction
Too nonspecific as free text. Include only when mapped to specific PT concepts.
| Policy | Definition | Handling |
|---|---|---|
| Include | Specific clinical concepts (PT-like) | Keep in Atlas AE nodes |
| Exclude | Severity tags (e.g., serious adverse event) | Not included or displayed in current model |
| Exclude/flag | Administrative terms (e.g., treatment emergent) | Exclude from AE nodes |
| Exclude/flag | Catch-alls (e.g., drug reaction) | Exclude unless standardized PT mapping exists |
| Conditional | Procedural terms | Include only when clinically meaningful |
3) Analytics and metrics
3.1 FAERS disproportionality
Disproportionality metrics reflect reporting patterns and should not be interpreted as incidence or relative risk in exposed populations.
Disproportionality metrics are influenced by co-reported drugs and confounding by indication.
2x2 table
a, b, c, d correspond to substance-event, substance-other, other-substance-event, other-substance-other.
Where D is the drug and E is the adverse event term.
ROR and CI
PRR and CI
IC and IC025
Let N=a+b+c+d. Smoothed IC approximation in the Atlas:
Approximate lower 95% bound displayed as IC025:
The implemented SE(IC) is derived on log scale and converted to base-2 scale.
Edge cases
When required cells make computations unstable (for example zero terms in critical denominators), metrics are suppressed and the UI shows an insufficient-data status.
3.2 Clinical trials summary metrics
Top-5 card is sorted by percent_affected. Substance + AE risk card reports raw weighted incidence and supporting counts from aggregated trial table.
Study-level inputs for meta-analysis
Meta-analysis is run at trial level (one row per drug_name + adverse_event_term + nct_id) using:
a: affected in treatment armn_treat: at-risk in treatment armc: affected in control armn_control: at-risk in control arm
Per-study effects
Risk definitions:
Risk difference (RD):
Risk ratio (RR) on log scale (with continuity correction when needed for zero-event cells):
Zero-event continuity correction is applied to RR components when required to keep logarithms and variances computable.
DerSimonian-Laird random effects
Fixed-effect weights:
Fixed-effect pooled estimate:
Cochran's Q and DL between-study variance:
Random-effects weights and pooled estimate:
For RR display:
For RD display, the same DL framework is applied directly to RD_i and \\operatorname{Var}(RD_i).
The DerSimonian-Laird estimator assumes approximately normally distributed study effects and may underestimate uncertainty when the number of studies is small.
Heterogeneity
Substantial heterogeneity may indicate differences in study populations, design, or reporting practices rather than true variation in drug effects.
The card reports:
Meta RR (DL)with 95% CIMeta RD (DL)with 95% CII^2and\\tau^2- Model label:
DerSimonian-Laird random effects
3.3 PubMed summary metrics
publication_count: unique PMID count per substance + AE- Recent studies: top 3 unique PMIDs by
pub_year DESC
4) Application Logic and Display Assumptions
4.1 Search behavior
- Substance list from dedicated source table
- AE suggestions merged from FAERS, clinical trials, PubMed
- Mix of
eqandilikebehavior across endpoints
4.2 Pagination and caching
PAGE_SIZEandDETAILS_PAGE_SIZEdrive payload and table pagination- FAERS signal uses cached RPC totals (
faersSignalTotalsCache) - PubMed and trial modals use local response caching
4.3 Charts
- FAERS charts read pre-aggregated by-year/sex/age structures
- Trials sex chart sums treatment + control participant counts across included rows (not unique persons)
- Trials age chart uses mean+SD normal approximation and tracks unknown age separately
Important interpretation note: Clinical trials sex and age charts represent demographics of the included trial cohorts selected by the current substance + AE query context. They are not AE-specific participant distributions (that is, not restricted to only participants who experienced the selected adverse event).
5) Data refresh & reproducibility
5.1 Pipeline overview
- FAERS ETL: ingest, dedup/versioning, aggregate outputs with date-stamped artifacts
- Trials ETL: extraction, arm mapping logic, effect fields and demographics
- PubMed ETL: query and extraction logic, year parsing, output artifacts
- Supabase load: copy commands, NA/null handling, downstream refresh steps
- Recommended build stamp in UI footer/wiki home for traceability
6) Limitations, bias, and appropriate use
- FAERS reporting bias
- PubMed publication bias
- Clinical trial selective reporting
- Confounding by indication/channeling
- Duplicate counting pitfalls across sources
- No time-to-event modeling in current metrics
- No dose-response modeling in current metrics
- No unified causal inference framework across sources
- No patient-level covariate adjustment in displayed summaries
- Metrics across FAERS, clinical trials, and literature are not directly comparable due to differences in data generation processes
Atlas metrics are for signal detection and prioritization, not standalone causal inference.
7) FAQ
Why does FAERS show more events than trials?
Different source design and capture behavior: spontaneous reporting vs structured study outputs.
Why do some terms have multiple organ systems?
A single PT can map to multiple SOCs; Atlas preserves that mapping.
Why is my AE missing from PubMed/trials?
Term mismatch, coverage limitations, or filter behavior can exclude specific terms.
Why do sex/age charts show unknowns?
Unknown appears when denominators exist but demographic fields are absent or non-standard.
Why are ROR/PRR/IC sometimes blank?
Metrics are suppressed when counts do not support stable computation.
