Datasets¶

At Vulners we continuously collect and normalize vulnerability intelligence from 220+ upstream sources—vendor advisories, NVD/CVE, Linux distro feeds, package registries, exploit/PoC trackers, and government/ISAC alerts—into a single graph‑linked corpus. Every record is deduplicated, cross‑referenced (CVE↔advisory↔patch↔exploit), and time‑stamped with standard fields (e.g., CVSS vectors, CWE, affected products) in consistent JSON. You mirror it via the Archive API and drop it straight into your warehouse or pipelines, so you skip scraping, parser maintenance, and feed orchestration and go directly to enrichment, analytics, and features.

Turn Vulners into your product’s vulnerability data backbone—without building a data pipeline from scratch. This page gives you both the why and the how so a founding engineer can stand up a reliable mirror, plug it into a warehouse, and start shipping features the same day.

Why this matters¶

Unified, graph-linked foundation. Work on a single, normalized corpus that connects CVEs, vendor advisories, exploits, patches, and real-world observations. No feed stitching, no custom parsers—just clean JSON ready for on-prem ingestion or lake/warehouse pipelines. This is the substrate for analytics, correlation, and product features.
Always in sync via Archive API. Start from a stable full snapshot, then apply compact incrementals keyed by timestamps.updated. Full archives are refreshed ~every 4 hours; incremental windows cover the last 25 hours. That design gives you reproducible jobs, auditability, and simple recovery if a run slips.
Built for research & ML velocity. Mine relationships (advisory↔exploit↔asset), trend patch latency, model exploit likelihood, and feed risk scoring—directly from consistent, quality-checked JSON spanning hundreds of sources.

Outcome: faster analytics, clearer relationships, and dramatically less ops work—so you can focus engineering on UX, detection logic, and product differentiation.

What you get (at a glance)¶

Collections: precompiled archives (e.g., debiancve, vendor advisories, exploit sources, etc.).
Transport: v4/archive/collection (full) + v4/archive/collection-update (incremental) for mirror-friendly sync. Full is CDN-backed; incrementals are DB-backed and sorted newest→oldest by timestamps.updated.
Contract: stable schemas, explicit update timestamps, reproducible snapshots, simple upsert keys (id).
Elasticsearch/OpenSearch‑ready: Clean, predictable JSON with stable id keys and timestamps.updated for incremental upserts—drop straight into time‑based indices and aggregations.

Architecture (typical)¶

Vulners Archive API
   ├─ Full snapshot (CDN, ~4h refresh)
   └─ Incrementals (≤25h window, sorted by timestamps.updated)
         ↓
Your Mirror (S3/GCS + object manifest OR Postgres/ClickHouse)
         ↓
Warehouse/Lake (BigQuery, Snowflake, DuckDB, Parquet)
         ↓
Product Features (enrichment, dashboards, alerts, ML)

Resilience: If a job slips past 25h, re-baseline from the full snapshot and resume incrementals. Keep a checkpoint of the newest timestamps.updated you’ve processed.

Quickstart: Mirror a collection (copy-paste)¶

Before you start: generate an API key and export it as VULNERS_API_KEY. See Authentication docs.

The snippet below baselines a collection and then applies an incremental update using your last checkpoint. Swap COLLECTION for anything you need and wire the items into your own storage/upsert logic.

Full collection is served from a precompiled archive and refreshed ~every 4h.
Update queries return records newer than a given timestamp, limited to the last 25h, sorted by timestamps.updated (desc).
Store the latest seen timestamps.updated as your checkpoint.
Full documentation on API calls: fetch_collection() and fetch_collection_update().

import time
import vulners

API_KEY = "VULNERS_API_KEY"
COLLECTION = "debiancve"   # change to any collection you need

v = vulners.VulnersApi(api_key=API_KEY)

def print_preview(items):
    """
    Print a compact, human-friendly preview that matches the sample output format:
    - First 4 items, then an ellipsis line, then the last item (if enough items)
    - Show only id and a flattened 'timestamps.updated'
    """
    def flat(rec):
        ts = (rec.get("timestamps") or {}).get("updated")
        return {"id": rec.get("id"), "timestamps.updated": ts}

    n = len(items)
    head = [flat(r) for r in items[:4]]
    tail = [flat(items[-1])] if n >= 5 else []
    print("[")
    for r in head:
        print(f"  {r},")
    if n >= 5:
        print("  ...,")
        print(f"  {tail[0]}")
    print("]")

# 1) Full collection snapshot (CDN-backed; refreshed ~every 4 hours)

t0 = time.time()
full = v.archive.fetch_collection(type=COLLECTION)
print(f"Number of items: {len(full)}")
print(f"Request processing time: {time.time() - t0:.2f} seconds")

print("\nExample response structure:")
print_preview(full)

# Use the most recent update timestamp as a checkpoint

latest_update = max(
    (rec.get("timestamps") or {}).get("updated")
    for rec in full
    if (rec.get("timestamps") or {}).get("updated")
)
print(f"Using 'latest_update': {latest_update}")

# 2) Incremental update since the checkpoint (DB-backed; 25h window)

t1 = time.time()
delta = v.archive.fetch_collection_update(type=COLLECTION, after=latest_update)
print(f"Number of items: {len(delta)}")
print(f"Request processing time: {time.time() - t1:.2f} seconds")

print("\nExample response structure:")
print_preview(delta)

How to realize value (implementation patterns)¶

Pick one of these starting points; all hinge on the same mirror pattern (full + incrementals + checkpoint + upsert).

1) Enrich your detection pipeline¶

Goal: add context (patches, KEV/exploit ties, vendor advisories) to findings in your SIEM/XDR.
How: upsert Vulners records by id, index cve, products, references, and timestamps.updated. Join on CVE or product/version keys.
Ship: “Why it matters” panels, fix links, exploit presence flags.

2) Asset & SBOM impact mapping¶

Goal: turn CMDB/SBOM into live risk lists.
How: map vendor/product/version tuples to affected entries; store joins; refresh nightly with incrementals.
Ship: tenant-scoped “impacted components” views, ticket/webhook automation.

3) Risk & prioritization models¶

Goal: move beyond raw CVSS—rank by exploitation, patch latency, and vendor cadence.
How: compute rolling metrics in your warehouse (e.g., days from advisory→patch; exploit linkage counts).
Ship: backlog ordering, SLA burn-down, board-friendly KPIs.

4) ML/RAG & analytics¶

Goal: power LLM features or analytics notebooks with consistent, graph-linked facts.
How: feed normalized records into a vector index (text fields + references) and keep it fresh via incrementals.

Production checklist¶

Baseline snapshot via fetch_collection(type=...). Store raw JSON + a flattened view for query.
Checkpointing: persist the newest timestamps.updated processed. First record of each incremental is a safe checkpoint.
Incrementals every 12–24h with a small overlap (<25h). Upsert by id.
Recovery: if a run slips >25h, re-baseline from full and resume.
Schema hygiene: keep id, timestamps.updated, and primary linkage fields indexed.
Observability: log counts, lag vs. checkpoint, and upsert failures.

Practical persistence details¶

Store latest_update in a durable location (file/DB).
On each run, upsert by id then advance the checkpoint to the newest timestamps.updated in the response.
If your scheduler slipped beyond 25h, take a fresh full collection and continue with incrementals.

Example output shape (for sanity checks)¶

A healthy run prints a compact preview (IDs + flat timestamps.updated)—use it to validate ordering and checkpointing during bring-up.

[
  {'id': 'DEBIANCVE:CVE-2024-53164', 'timestamps.updated': '...'},
  {'id': 'DEBIANCVE:CVE-2025-38079', 'timestamps.updated': '...'},
  ...,
  {'id': 'DEBIANCVE:CVE-2021-34183', 'timestamps.updated': '...'}
]

FAQ¶

Is the data reproducible across runs?
Yes. Full collections are precompiled and refreshed on a cadence, and incrementals are keyed to update timestamps. You can re-run jobs, audit changes, and align downstream models to a clear source of truth.

How do I avoid gaps?
Schedule incrementals within the 25-hour window and keep a modest overlap. On failure or drift, take a fresh full snapshot and continue.

How do I scale storage/compute?
Keep raw JSON blobs for fidelity; project a columnar subset (Parquet/Delta) for analytics. Index id and timestamps.updated for fast upserts and recency queries.

TL;DR¶

Plug in once (full + incrementals), mirror locally, upsert by id, checkpoint by timestamps.updated.
From there, ship enrichment, risk models, analytics, and ML—without building a vulnerability corpus or ingestion logic yourself.