How I built the DFG database from scratch

GEPRIS is the German Research Foundation's public grant database. It contains every DFG-funded project since the 1990s — researcher names, institutions, project titles, funding amounts (indirectly), and status. It is the most complete public record of German academic funding in existence.

It has no public API.

So I scraped it.

Why this matters

Before I built this, finding DFG-funded researchers in a specific field required either:

Manually searching GEPRIS by keyword (partial results only)
Paying for institutional data access (not an option for most researchers)
Guessing

The data was publicly available but practically inaccessible. Making it searchable and structured changes what you can do with it.

The architecture

Phase 1: Researcher ID discovery

GEPRIS assigns every researcher a numeric ID. The IDs are sequential and start from low numbers. The scraper iterated through IDs, fetching each researcher's summary page, and extracted: name, institution, city, country, and a list of associated project IDs.

The total ID space is around 150,000 — most are empty or historical. About 35,000 yield active researcher pages.

Phase 2: Project data

For each researcher, the scraper fetched each project page and extracted: title, programme type, subject area, status, and a flag for running vs. completed.

The subject area field is the interesting one — GEPRIS uses German-language categories in a nested hierarchy. Each project has a primary review board area and a sub-area. Translating and normalising these into consistent English labels took a separate preprocessing step.

Phase 3: Grant counting

For each researcher, I counted only projects with status=running. This gives the "active grant count" — a proxy for current hiring capacity.

The infrastructure

I ran everything on a cloud server (VPS, 8 cores, 16GB RAM) rather than my laptop. Reasons:

Duration: The full scrape takes 28–32 hours. Not realistic on a laptop that needs to sleep and move.
IP rate limiting: GEPRIS rate-limits requests. Spread across a server with configurable delays, this is manageable. On a residential IP, it's less predictable.
Resumability: The scraper writes to a database after each batch. If it crashes (it did, twice), it resumes from where it stopped — no data loss.

The scraper is written in Python using requests and BeautifulSoup4. The database is PostgreSQL, hosted on Supabase. The front end is this Next.js site.

The quirks

Multi-value area fields: Some projects list two or three sub-areas separated by newlines. The first version of the enrichment script treated the whole string as one area — wrong. Fixed by splitting on newline and counting each area individually.

Name deduplication: Some researchers appear under slight name variations (with and without titles, different hyphenation). I used GEPRIS researcher IDs as the primary key to avoid double-counting.

Title noise in names: GEPRIS stores researcher names with academic titles included — "Prof. Dr. Ivan Dikic". For external search links (Google Scholar, PubMed), the titles need to be stripped. I wrote a cleanName() utility that handles German academic title conventions in the right order (longest compound forms first).

Status lag: GEPRIS sometimes shows completed projects as "running" for weeks after they end. The data has a lag of roughly 2–4 weeks vs. real grant status.

What the data can tell you (and what it can't)

Can: Who holds many concurrent grants. Which institutions have the highest total grant load. Which sub-fields attract the most DFG funding. Grant count as a proxy for hiring capacity.

Can't: Exact funding amounts (GEPRIS doesn't publish this). Whether a grant funds one person or ten. Whether a grant is in its first year or final months.

The DFG Finder makes this data searchable. The methodology is transparent: April 2026 snapshot, running projects only, public GEPRIS data.

Data was last updated April 2026. If you want to discuss using this for your lab or institution, get in touch.

How I built the DFG database from scratch

Why this matters

The architecture

The infrastructure

The quirks

What the data can tell you (and what it can't)

Get the next post in your inbox

More posts

The Hidden Cost of Big Labs

How to Use AI and Data to Find Well-Funded PIs

The guerrilla strategy for academic job searching