What you're building
A prospecting pipeline backed by Cloudflare D1 that discovers businesses from two public data sources — county parcel records and Google Places — deduplicates them, and enriches each lead before an AI coding agent scores, classifies, and loads them into D1.
The two discovery paths have different enrichment needs, so they get separate diagrams.
Places-first path
Google Places is the entry point. Each result already includes coordinates, website, rating, and review count — so geocode and business intel enrichment are skipped. The remaining stages fill in email and ownership data.
Parcel-first path
Parcel records are the entry point. For each unlinked parcel, a single Places Text Search call finds the operating business and returns coordinates, rating, reviews, and status in the same response — so no separate geocode or business intel passes are needed. The remaining enrichment stages fill in email and confirm ownership.
Everything lands in a Cloudflare D1 database. The discovery scripts output SQL files, the enrichment scripts query D1 directly via wrangler d1 execute, and the AI agent reads and writes through the same interface. One database, no moving parts.
Why two sources? Places-first gives you broad coverage — every dentist, restaurant, or church in a city. Parcel-first gives you owner-operators — the person who owns the building is often the decision-maker. The two pipelines find different leads with different conversion profiles. Run both.
Prerequisites
- A Cloudflare account (free tier works)
wranglerCLI installed and authenticated (npm install -g wrangler && wrangler login)- A Google Places API key (the new API, not the legacy one)
- Python 3 (stdlib only for core pipeline)
- County parcel data as shapefiles or file geodatabases — most US counties publish these for free
- An AI coding agent (Claude Code, Cursor, etc.)
bash# ~/.zshrc or ~/.bashrc
export GOOGLE_API_KEY="your-places-api-key"
Set up D1
Create the database and apply a migration with two tables: leads for the prospects the pipeline discovers, and parcels for the county property records you'll match against.
bashnpx wrangler d1 create prospecting-db
# Copy the database_id from the output
The schema
sql-- 0001_initial.sql
CREATE TABLE leads (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company TEXT,
phone TEXT,
email TEXT,
website TEXT,
address TEXT,
city TEXT,
state TEXT,
zip TEXT,
lat REAL,
lng REAL,
lead_type TEXT DEFAULT 'tenant',
score TEXT,
parcel_id TEXT,
rating REAL,
review_count INTEGER,
business_status TEXT,
place_id TEXT,
geocode_enriched INTEGER NOT NULL DEFAULT 0,
email_enriched INTEGER NOT NULL DEFAULT 0,
hunter_enriched INTEGER NOT NULL DEFAULT 0,
parcel_enriched INTEGER NOT NULL DEFAULT 0,
intel_enriched INTEGER NOT NULL DEFAULT 0,
created_at TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE TABLE parcels (
prop_id TEXT PRIMARY KEY,
owner_name TEXT,
situs_addr TEXT,
situs_city TEXT,
situs_zip TEXT,
land_use TEXT,
acres REAL,
market_value REAL
);
CREATE INDEX idx_leads_city ON leads(city);
CREATE INDEX idx_leads_enrichment ON leads(geocode_enriched, email_enriched, hunter_enriched, parcel_enriched, intel_enriched);
CREATE INDEX idx_parcels_city ON parcels(situs_city);
CREATE INDEX idx_parcels_addr ON parcels(situs_addr);
The five _enriched flags on leads are the backbone of the pipeline. More on those in the enrichment section.
bashnpx wrangler d1 migrations apply prospecting-db --remote
Load parcel data into D1
This is a one-time step. US counties publish property records as GIS datasets — owner name, address, land use code, acreage, market value. You download the shapefiles, extract the commercial parcels, and load them into D1's parcels table. After that, everything queries the database.
In Texas, the statewide parcel dataset lives at data.geographic.texas.gov. Most states have an equivalent. Download shapefiles for the counties you care about.
Extract and filter
Parcel shapefiles contain residential, commercial, agricultural, and exempt properties. You only want the commercial ones. Land use codes vary by county, but most follow a pattern:
| Code | Category |
|---|---|
| F1, F3 | Commercial / office |
| B1, B2, B3 | Multifamily |
| EX9 | Church / nonprofit |
| EX5 | School |
Write a script that reads the .dbf file (the attribute table in a shapefile), filters by land use code, and writes the records you care about:
pythonimport struct
def read_dbf(path):
"""Read a .dbf file into a list of dicts. stdlib only."""
with open(path, 'rb') as f:
header = f.read(32)
num_records = struct.unpack('<I', header[4:8])[0]
header_size = struct.unpack('<H', header[8:10])[0]
record_size = struct.unpack('<H', header[10:12])[0]
fields = []
while True:
field_desc = f.read(32)
if field_desc[0] == 0x0D:
break
name = field_desc[:11].split(b'\x00')[0].decode()
ftype = chr(field_desc[11])
fsize = field_desc[16]
fields.append((name, ftype, fsize))
f.seek(header_size)
rows = []
for _ in range(num_records):
record = f.read(record_size)
if record[0:1] == b'*':
continue
pos = 1
row = {}
for name, ftype, fsize in fields:
val = record[pos:pos+fsize].decode('latin-1').strip()
row[name] = val
pos += fsize
rows.append(row)
return rows
COMMERCIAL_CODES = {'F1', 'F3', 'F10', 'B1', 'B2', 'B3', 'EX9', 'EX5'}
parcels = read_dbf('collin_county/parcels.dbf')
commercial = [p for p in parcels if p.get('land_use') in COMMERCIAL_CODES]
Filter by your ideal customer profile
Not every commercial parcel is worth pursuing. Before you spend API calls on enrichment, filter for the businesses you actually want to reach. Define your ICP and drop everything else.
This depends on what you sell. A landscaping company might want parcels over two acres. A SaaS company might only want medical offices. A cleaning service might want office buildings above a certain market value. Whatever the criteria, apply them early — enrichment costs money per lead.
pythondef matches_icp(parcel):
"""Filter parcels by your ideal customer profile. Customize this."""
land_use = parcel.get('land_use', '')
if land_use not in COMMERCIAL_CODES:
return False
# Example: skip parcels without a street address (can't geocode)
if not parcel.get('situs_addr'):
return False
# Example: only properties above a certain acreage
try:
acres = float(parcel.get('acres', 0))
if acres < 0.25:
return False
except ValueError:
pass
return True
filtered = [p for p in commercial if matches_icp(p)]
After filtering, a single county typically yields thousands of parcels worth investigating. Four counties in the Dallas-Fort Worth area produced ~86,000 commercial parcels total — with ICP filters applied, that narrowed to around 12,000.
Load into D1
Generate SQL INSERTs from the filtered parcels and apply them:
pythonwith open('/tmp/load_parcels.sql', 'w') as f:
for p in filtered:
f.write(
f"INSERT OR IGNORE INTO parcels (prop_id, owner_name, situs_addr, situs_city, "
f"situs_zip, land_use, acres, market_value) VALUES ("
f"'{p['prop_id']}', '{p['owner_name']}', '{p['situs_addr']}', "
f"'{p['situs_city']}', '{p.get('situs_zip','')}', '{p['land_use']}', "
f"{p.get('acres', 0)}, {p.get('market_value', 0)});\n"
)
bashnpx wrangler d1 execute prospecting-db --remote --file /tmp/load_parcels.sql
Once the parcels are in D1, you never touch the shapefiles again. Discovery scripts query the parcels table directly.
The two discovery pipelines
Pipeline A: Places-first
Search Google Places by business type and city. This finds every business matching your categories, whether they own or rent their space. Good for broad coverage.
pythonimport json
import urllib.request
PLACES_BASE = "https://places.googleapis.com/v1/places"
def places_search(query, api_key, max_results=20):
payload = json.dumps({
"textQuery": query,
"maxResultCount": max_results,
}).encode()
req = urllib.request.Request(
f"{PLACES_BASE}:searchText",
data=payload,
headers={
"Content-Type": "application/json",
"X-Goog-Api-Key": api_key,
"X-Goog-FieldMask": ",".join([
"places.id", "places.displayName",
"places.nationalPhoneNumber", "places.formattedAddress",
"places.location", "places.websiteUri",
"places.rating", "places.userRatingCount",
"places.businessStatus",
]),
},
method="POST",
)
with urllib.request.urlopen(req, timeout=15) as r:
return json.loads(r.read()).get("places", [])
Run a batch of queries by category and city:
pythonimport time
QUERIES = [
"dentist Allen TX",
"dental office Allen TX",
"restaurant Allen TX",
"church Allen TX",
"auto repair Allen TX",
]
all_results = []
for q in QUERIES:
results = places_search(q, api_key=os.environ["GOOGLE_API_KEY"])
all_results.extend(results)
time.sleep(0.2)
Pipeline B: Parcel-first
Query unlinked parcels from D1 — ones not yet matched to any lead. For each parcel address, search Google Places to find the operating business. This finds owner-operators — the person on the parcel deed is often the decision-maker you want to reach.
bash# Find parcels not yet linked to any lead
npx wrangler d1 execute prospecting-db --remote --json \
--command "SELECT * FROM parcels WHERE prop_id NOT IN (SELECT parcel_id FROM leads WHERE parcel_id IS NOT NULL)"
pythondef find_business_at_parcel(parcel, api_key):
"""Look up the Google Places business operating at a parcel address."""
addr = f"{parcel['situs_addr']}, {parcel['situs_city']}, TX"
results = places_search(addr, api_key, max_results=5)
if not results:
return None
for place in results:
place_addr = place.get("formattedAddress", "").upper()
parcel_street = parcel["situs_addr"].upper().split()[0]
if parcel_street in place_addr:
return place
return results[0]
The parcel tells you who owns the property. The Places result tells you who operates there. When those match (the business name resembles the owner name), you've found an owner-operator — someone who both controls the property and runs the business.
Deduplication
Both pipelines produce overlapping results. Before inserting, export your existing leads from D1 for dedup:
bashnpx wrangler d1 execute prospecting-db --remote --json \
--command "SELECT company, phone, address, city FROM leads" > /tmp/existing_leads.json
Deduplicate by phone number first (most reliable), then by normalized company name and street address:
pythonimport re
def normalize(s):
"""Strip punctuation, collapse whitespace, uppercase."""
return re.sub(r'\s+', ' ', re.sub(r'[^A-Z0-9 ]', '', s.upper())).strip()
def dedup_leads(new_leads, existing_leads):
"""Filter out leads already in D1."""
seen_phones = {c['phone'] for c in existing_leads if c.get('phone')}
seen_names = {normalize(c['company']) for c in existing_leads if c.get('company')}
seen_addrs = {normalize(c['address']) for c in existing_leads if c.get('address')}
unique = []
for lead in new_leads:
phone = lead.get('phone', '')
name = normalize(lead.get('company', ''))
addr = normalize(lead.get('address', ''))
if phone and phone in seen_phones:
continue
if name and name in seen_names:
continue
if addr and addr in seen_addrs:
continue
seen_phones.add(phone)
seen_names.add(name)
seen_addrs.add(addr)
unique.append(lead)
return unique
Three-layer dedup matters. A business might have a different phone on Google than in your database (cell vs. landline). Or the address might be formatted differently ("123 Main St" vs. "123 Main Street"). Checking all three catches most duplicates without false positives.
The enrichment pipeline
After discovery, each lead has basic info — name, address, maybe a phone number. Enrichment fills in the rest: email, property ownership, and for leads that didn't come through a Places call, coordinates and business intelligence.
Both pipelines call the Places API during discovery and get coordinates, rating, reviews, and status back in that same response. Set the geocode_enriched and intel_enriched flags at insert time for those leads — no separate pass needed. The flags exist for leads imported from other sources that didn't go through Places.
The flag pattern
Each enrichment stage is controlled by a boolean flag on the lead record. The flag means "this stage has been attempted" — not "it succeeded." This prevents retrying dead ends (businesses with no website will never yield an email, and you don't want to re-scrape them every run).
| Flag | Stage | What it does |
|---|---|---|
geocode_enriched | Geocode | Backfill lat/lng — set at insert for Places/Parcel discovery, run separately for other sources |
email_enriched | Email Scrape | Scrape website homepage + /contact page for email addresses |
hunter_enriched | Hunter.io | Domain-based email lookup for leads with a website but no scraped email |
parcel_enriched | Parcel Link | Match address to parcel records for owner info |
intel_enriched | Business Intel | Rating, reviews, status — set at insert for Places/Parcel discovery, run separately for other sources |
Every script follows the same pattern:
bash# 1. Query D1 for unenriched candidates
npx wrangler d1 execute prospecting-db --remote --json \
--command "SELECT * FROM leads WHERE geocode_enriched = 0 AND address IS NOT NULL"
python# 2. Process each one, build SQL updates
for lead in candidates:
result = do_enrichment(lead)
updates.append(build_update(lead, result))
# 3. Set the flag regardless of outcome
updates.append(f"UPDATE leads SET geocode_enriched = 1 WHERE id = {lead['id']};")
# 4. Write updates to a SQL file, apply to D1
with open('/tmp/geocode_updates.sql', 'w') as f:
f.write('\n'.join(updates))
bashnpx wrangler d1 execute prospecting-db --remote --file /tmp/geocode_updates.sql
The "set flag regardless" rule is what makes the pipeline idempotent. Run it as many times as you want — it only processes new leads.
Stage 1: Geocode
If a lead has an address but no coordinates, geocode it via Places Text Search. You need lat/lng for distance filtering and territory mapping later.
pythondef geocode(address, api_key):
results = places_search(address, api_key, max_results=1)
if results and "location" in results[0]:
loc = results[0]["location"]
return loc["latitude"], loc["longitude"]
return None, None
Stage 2: Email scraping
For each lead with a website, fetch the homepage and the /contact page, then regex out email addresses. Filter aggressively — junk domains, noreply addresses, and framework artifacts pollute the results.
pythonimport re
import urllib.request
import ssl
EMAIL_RE = re.compile(r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}')
JUNK_DOMAINS = {
'example.com', 'sentry.io', 'wixpress.com', 'googleapis.com',
'w3.org', 'schema.org', 'wordpress.org', 'squarespace.com',
'gravatar.com', 'godaddy.com',
}
JUNK_PREFIXES = {'noreply', 'no-reply', 'mailer-daemon', 'postmaster', 'webmaster'}
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
def scrape_emails(website):
"""Scrape email addresses from a website's homepage and /contact page."""
emails = set()
for path in ['', '/contact', '/contact-us']:
url = website.rstrip('/') + path
try:
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
with urllib.request.urlopen(req, timeout=10, context=ctx) as r:
html = r.read().decode(errors='replace')
for match in EMAIL_RE.findall(html):
addr = match.lower()
domain = addr.split('@')[1]
prefix = addr.split('@')[0]
if domain not in JUNK_DOMAINS and prefix not in JUNK_PREFIXES:
emails.add(addr)
except Exception:
continue
return list(emails)
Why the relaxed SSL context? Small-business websites frequently have expired, self-signed, or misconfigured certificates. You're scraping a public homepage for a mailto link, not transmitting credentials. Verify before sending. Scraped emails are unverified — the address may exist in the HTML but bounce when you send to it. Run leads through an email verification service (ZeroBounce, NeverBounce, or Hunter's built-in verifier) before any outreach campaign. Sending to a cold, unverified list damages your sender reputation faster than any other mistake.
Stage 3: Hunter.io email enrichment
For leads with a website but no email after scraping, Hunter.io can find a deliverable address by querying the domain. This catches professional services and companies that hide their email behind contact forms.
pythonimport urllib.request
import json
import os
def hunter_find_email(domain, api_key):
"""Find the most likely email address for a domain via Hunter.io."""
url = f"https://api.hunter.io/v2/domain-search?domain={domain}&api_key={api_key}&limit=1"
try:
with urllib.request.urlopen(url, timeout=10) as r:
data = json.loads(r.read())
emails = data.get("data", {}).get("emails", [])
if emails:
return emails[0].get("value")
except Exception:
pass
return None
Query D1 for leads that have a website but no email and haven't been through Hunter yet:
bashnpx wrangler d1 execute prospecting-db --remote --json \
--command "SELECT id, company, website FROM leads WHERE hunter_enriched = 0 AND website IS NOT NULL AND (email IS NULL OR email = '')"
Hunter.io's free plan gives you 50 requests/month — enough for a small pipeline. It works best for companies with a domain-level email pattern (e.g. firstname@company.com). The flag follows the same idempotent pattern — set it regardless of whether an email was found.
Stage 4: Parcel linking
Match leads to parcel records in D1 by normalized street address. When a lead's address matches a parcel row, you get the owner's name, acreage, market value, and land use — context that feeds into scoring later.
pythondef normalize_address(addr):
"""Collapse to comparable form: uppercase, strip suite/unit, abbreviate."""
addr = addr.upper().strip()
addr = re.sub(r'\b(SUITE|STE|UNIT|APT|#)\s*\S+', '', addr)
addr = re.sub(r'\bSTREET\b', 'ST', addr)
addr = re.sub(r'\bAVENUE\b', 'AVE', addr)
addr = re.sub(r'\bBOULEVARD\b', 'BLVD', addr)
addr = re.sub(r'\bDRIVE\b', 'DR', addr)
return re.sub(r'\s+', ' ', addr).strip()
Stage 5: Business intelligence
Google Places returns more than a name and address. The same API call that discovered the lead can populate review count, average rating, business status, and category. You use these to prioritize outreach.
pythondef enrich_business_intel(lead, api_key):
"""Pull business intelligence from Google Places for a lead."""
query = f"{lead['company']} {lead['address']}"
results = places_search(query, api_key, max_results=1)
if not results:
return {}
place = results[0]
return {
'rating': place.get('rating'),
'review_count': place.get('userRatingCount', 0),
'business_status': place.get('businessStatus', 'UNKNOWN'),
'place_id': place.get('id'),
}
A business with 200 reviews is probably larger than one with 5. A business marked CLOSED_TEMPORARILY is not worth emailing. These signals come back in the same API response you already paid for.
AI-operated qualification
Two jobs that would take a human hours per batch, but an AI agent can do in minutes:
Lead scoring
The agent reviews enrichment data and assigns a priority tier. What counts as "hot" depends on your business — here's the signal table:
| Signal | Where it comes from | What it indicates |
|---|---|---|
| Review count | Google Places | Business size and foot traffic |
| Has email | Email scrape | Reachable via cold outreach |
| Owner-operator match | Parcel link | Decision-maker identified |
| Property value / acreage | Parcel data | Scale of opportunity |
| Business status | Google Places | Active vs. closed vs. temporarily shut |
The agent weighs these signals against your ICP and assigns a tier (hot, warm, cold). Present the results in a table and let the user adjust before writing to D1.
Owner-operator detection
When a lead's business name resembles the parcel owner's name, they're likely an owner-operator — they own the building and make decisions about it directly. If you sell anything property-related, this is the person you want to talk to.
pythonSTOP_WORDS = {'INC', 'LLC', 'LP', 'LTD', 'CORP', 'CO', 'THE', 'OF', 'AND', '&'}
def fuzzy_owner_match(business_name, owner_name):
"""Check if a business name is a fuzzy match for a parcel owner name."""
def tokenize(s):
tokens = re.sub(r'[^A-Z0-9 ]', '', s.upper()).split()
return {t for t in tokens if t not in STOP_WORDS and len(t) > 1}
biz_tokens = tokenize(business_name)
own_tokens = tokenize(owner_name)
if not biz_tokens or not own_tokens:
return False
overlap = biz_tokens & own_tokens
return len(overlap) >= 2 or (len(overlap) == 1 and len(biz_tokens) <= 2)
Churches, schools, and fraternal organizations almost always own and operate their facility. The agent can classify these automatically without fuzzy matching.
Lead type taxonomy
Every lead gets a type based on the relationship between the business and the property. This determines who you're actually talking to when you reach out:
| Type | Who they are | What it means for outreach |
|---|---|---|
owner_operator | Owns the building, runs the business | Direct decision-maker |
tenant | Rents space, doesn't control the property | May need to route you to their landlord |
property_manager | Manages properties for owners | Controls vendor selection for multiple sites |
The pipeline assigns owner_operator when the business name matches the parcel owner, and tenant otherwise. The agent can reclassify during qualification if it finds evidence one way or the other.
Wire it up as a skill
The whole pipeline is designed to be run by an AI coding agent. Here's the Claude Code skill file that orchestrates it:
yaml---
name: prospecting
description: Discover and enrich new prospects from public records.
Use when user mentions "find prospects", "new leads", "prospecting",
"parcel search", or "enrich contacts". Handles Places search,
parcel-first discovery, dedup, and multi-stage enrichment.
---
Workflow
The skill body walks the agent through seven steps:
- Ask discovery source. Always start here — the user picks Places-first or Parcel-first.
- Analyze gaps. Query D1 for coverage by city and enrichment progress. Help the user pick targets where coverage is thin.
- Run discovery. Execute the chosen pipeline, deduplicate against existing D1 records.
- Review and insert. Show the user totals, city breakdown, dedup stats. Get explicit confirmation before writing to D1.
- Enrich. Run all five stages in order. Each stage is a separate script invocation — geocode, email scrape, Hunter.io, parcel link, business intel.
- Score and classify. The agent reviews enrichment data, assigns lead scores and types. Present a table for user confirmation before updating.
- Verify. Query enrichment progress, show before/after comparison.
Rules to encode in the skill
- Never insert duplicates. The dedup step is non-negotiable.
- Confirm before inserting. The agent must present discovery results and wait for user approval.
- Confirm before reclassifying. Score and type changes must be reviewed.
- Set flags regardless of outcome. The idempotent flag pattern is the backbone of the pipeline.
The pattern: The skill file defines workflow and rules. The scripts do the work. The agent orchestrates — it asks the user for decisions at each gate, runs the right script, and presents results. You approve, it executes.
What's next
- Cold email sequences. Build template-per-lead-type outreach and wire up a sending service (Resend, SendGrid, SES). The pipeline already classifies leads by type — map each type to a template.
- Scheduled enrichment. Run enrichment nightly as a cron job. New leads from today's discovery get geocoded and email-scraped by morning.
- Multi-county expansion. Each county is a new shapefile download and a filter config. The pipeline scales linearly with no code changes.
- Additional enrichment stages. LinkedIn for decision-maker names. Yelp for review data. Clearbit for company firmographics. Each one is a new flag and a new script — the pattern stays the same.
