Frequently Asked Questions
Common questions about our data, methodology, and project goals.
About the Data
Our primary data source is the IRS Exempt Organizations Business Master File. We identify churches using IRS-specific classification codes:
foundation = '10'(IRS Church foundation code)filing_requirement = '06'(Church filing exemption)
However, the IRS data has inherent messiness:
- Organizations self-report their classification when filing
- The IRS does not actively verify or audit these classifications
- Some organizations have codes that were assigned decades ago and never updated
- Religious-adjacent organizations (retreat centers, religious publishers) sometimes share similar codes
Our website-matching algorithm helps filter these out by verifying the organization actually operates as a congregation, but some false positives remain in the base dataset.
We use a multi-phase geographic approach to process churches, starting with familiar regions before expanding nationwide. This strategy is driven by two factors:
- API Cost Management: Google Custom Search API costs approximately $5 per 1,000 queries. With ~280,000 churches in the dataset, full processing would cost ~$1,400. We process churches daily using the free tier to manage costs.
- Algorithm Refinement: By starting with a region the author is familiar with (Southern California), we can validate results and tune our LLM matching prompts before tackling the full dataset.
Processing phases are organized by ZIP code ranges, covering regions in this order:
- Southern California
- West Coast metros (Bay Area, Portland, Seattle)
- South
- Midwest
- Northeast
- Mountain West
- Remote areas and territories
Many churches genuinely do not have websites. Research shows that approximately 70% of American churches have fewer than 100 attendees, and smaller congregations are far less likely to maintain a web presence.
Our current website match rate of approximately 25% aligns with expectations given this distribution. Common reasons churches lack discoverable websites include:
- Resource constraints (website hosting requires ongoing time and money)
- Many small churches use Facebook as their only digital presence
- Older congregations may prioritize in-person community over digital presence
- Some communities have cultural factors that de-emphasize online presence
Research Findings
Processing church data has revealed several interesting patterns about American religious organizations and their digital presence:
The Digital Divide
There is a significant disparity in web presence across different church types:
- High web presence: Mainline Protestant churches, Catholic parishes (via diocesan support), megachurches
- Low web presence: Small congregations, newer churches, immigrant communities
Facebook as the Primary Platform
For many smaller congregations, Facebook is their entire digital presence. They don't maintain traditional websites—instead using Facebook for announcements, service times, and community engagement. This is particularly common in Spanish-language churches and congregations without dedicated administrative staff.
Proprietary Platform Lock-in
Many churches use platforms like Planning Center (Church Center) for their member-facing web presence. While these provide polished experiences for congregants, the data about church services, ministries, and programs is locked within these proprietary systems and not discoverable through web search or scraping.
IRS Data Quirks
The IRS data contains interesting artifacts:
- City names are often truncated ("SN BERNRDNO" for San Bernardino) due to field length limits
- Organization names may be decades old, not reflecting current church names
- Many churches use variations ("Saint" vs "St.", "First Baptist" vs "FBC")
Project Goals
This project aims to create an enriched dataset of American churches capable of answering aggregate questions like: "Which churches offer this service or program?"
The original motivation came from a personal struggle: trying to find churches with co-ed groups for single 30-somethings. In a rapidly self-isolating America, finding authentic community through religious organizations shouldn't require visiting dozens of church websites individually.
Currently, if you want to know which churches in your area have:
- Young adult ministries
- Recovery programs
- Food banks or community meals
- Contemporary vs. traditional worship
- Small groups or life groups
...you have to manually check each church's website. This project extracts structured data from church websites so these questions can be answered at scale, making it easier for people seeking community to find churches that match their needs.
Several church directories exist, including ChurchFinder.com, FindAChurch.com, and denomination-specific directories like The Gospel Coalition's church search. These services have fundamental limitations:
- Self-submitted data: Churches must opt-in and manually enter their information. This creates selection bias—only churches aware of and motivated to join these directories appear.
- No structured ministry data: Most directories only capture basic info (name, address, denomination). You can't search for "churches with recovery programs" or "churches with young adult groups."
- Denominational silos: Many directories only list churches from specific denominations or theological traditions.
- Stale information: Without active maintenance, directory listings become outdated as churches move, close, or change leadership.
This project takes a different approach: start with the authoritative IRS registry of all tax-exempt religious organizations, then enrich it with structured data extracted directly from church websites using AI. The result is a more comprehensive and queryable dataset than any self-submitted directory can provide.
This project began in early 2026. It is an ongoing effort—processing the full dataset of ~280,000 churches takes time due to API rate limits and the need for careful validation of matching algorithms. Check the Progress page to see current processing status.
Yes, the enriched church dataset will be released publicly under a non-commercial license. The data will be free to use for research, personal projects, and non-commercial applications.
The code used to generate this dataset is private and will not be open-sourced.