Batch Investor Transcript Mining — All 63 Buyer Targets

Source: deal-docs/04-prompts/batch-transcript-mining-all-63.md

Batch Investor Transcript Mining — All 63 Buyer Targets

YOUR TASK

Run investor transcript mining for all 63 buyer targets in the buyer_dossiers table. For each company, search for earnings call transcripts, investor presentations, and public filings. Extract quotes, financial data, strategic signals, challenges language, CEO vision, and M&A appetite. Synthesize "golden nugget" quotes for cold-call openers. Update each company's existing row in Supabase. Save a local JSON per company. Run autonomously through all 63 targets — do NOT stop to ask questions.

WORKING DIRECTORY

~/Projects/dossier-pipeline/

ARCHITECTURE DECISIONS (LEARNED FROM SAP PILOT)

These are NOT suggestions — these are fixes from bugs encountered during the SAP run:

1. Supabase Column Types — CRITICAL

Two columns are Postgres text[] arrays, NOT JSONB:
- sec_risk_factors_relevant → must be Postgres array literal: '{"item 1","item 2"}'
- vision_strategic_priorities → must be Postgres array literal: '{"item 1","item 2"}'

All other JSON-looking columns (certified_key_quotes, quotes, challenges_earnings_language, sec_filing_raw, vision_raw, challenges_raw) are text columns that store JSON strings. Use json.dumps() for these.

Implementation pattern:

# For text[] columns — Postgres array literal
def to_pg_array(items):
    """Convert Python list to Postgres array literal string."""
    if not items:
        return '{}'
    escaped = [item.replace('"', '\\"') for item in items]
    return '{' + ','.join(f'"{e}"' for e in escaped) + '}'

patch['sec_risk_factors_relevant'] = to_pg_array(["risk 1", "risk 2"])
patch['vision_strategic_priorities'] = to_pg_array(["priority 1", "priority 2"])

# For text/JSONB columns — JSON string
patch['certified_key_quotes'] = json.dumps(quotes_list)
patch['quotes'] = json.dumps(quotes_list)

2. Three Tiers of Companies — Different Search Strategies

Not all 63 companies have earnings calls. Adapt the search strategy per buyer_type:

PUBLIC companies (44): Have quarterly earnings calls. Search for transcripts on SeekingAlpha, Motley Fool, Nasdaq, TipRanks, Yahoo Finance. Also search for investor day presentations and annual reports/10-K/20-F filings.

PE_BACKED firms (12): No earnings calls. Instead search for: portfolio company announcements, fund letters to LPs, conference presentations, podcast interviews with partners, press releases about investments in HR tech, community, or data companies. Search broader: TechCrunch, Business Insider, PE Hub, Pitchbook news.

PRIVATE companies (7): No earnings calls. Search for: CEO interviews, podcast transcripts, conference keynotes, blog posts from leadership, press releases, funding announcements. These often have the most candid strategic statements.

3. Company-Specific Search Tuning

Some companies need special search handling:
- SAP SuccessFactors → already done, SKIP
- Oracle HCM → search "Oracle" not "Oracle HCM" for earnings
- Ceridian (Dayforce) → search both "Ceridian" and "Dayforce" (rebranded)
- UKG (Ultimate Kronos) → search "UKG" and "Ultimate Kronos Group"
- Amazon (AWS) → search "Amazon" earnings, focus on AWS segments
- RELX (Reed Elsevier) → search both names
- Manpower (ManpowerGroup) → search "ManpowerGroup"
- Dotdash Meredith → search "Dotdash Meredith" and "IAC" (parent)
- Industry Dive → was acquired by Informa, search both

4. Rate Limit Handling

Mistral rate-limits at 30 RPM. During the SAP run, we hit a 429 on the 6th call and auto-fell back to DeepSeek.
- Add 2-second delay between Mistral calls (not just the 0.3s Exa delay)
- The fallback chain in lib/llm_client.py handles this automatically — Mistral → DeepSeek
- Budget: Exa ~$0.012/quarter/company, Mistral ~$0.002/call. Full run ≈ $2-4 total.

5. Exa Search — Use max_chars=8000

The default 2000 chars was too short for transcript snippets. Use 8000 for transcript searches to capture full quote context.

6. Truncation Safety

Mistral's context is 30K chars. When combining multiple Exa results, cap the combined text at 25K chars to leave room for the extraction prompt.

HOW TO EXECUTE

Step 1: Create the script

Write scripts/batch_transcript_miner.py that:

Reads all 63 rows from buyer_dossiers table
Skips any row where quote_count > 0 (already mined — currently only SAP)
For each remaining company:
a. Determine search strategy based on buyer_type
b. Run 2-3 Exa searches per "period" (8 quarters for PUBLIC, 6 searches for PE/PRIVATE)
c. Extract with Mistral (researcher_1) — use generate_json() for clean parsing
d. After all quarters extracted, synthesize golden nuggets with Claude CLI (synthesizer)
e. PATCH the Supabase row with all extracted data
f. Save local JSON to data/buyer_dossiers/{slug}_transcripts.json
g. Log cost to dossier_cost_log
h. Print progress: [14/63] ServiceNow — 23 quotes, 5 golden nuggets, $0.18

Step 2: Supabase PATCH — Column Mapping

For each company, update these columns in the existing row:

patch = {
    # Text columns (JSON strings)
    "certified_key_quotes": json.dumps(certified_quotes_list),
    "challenges_earnings_language": json.dumps(challenges_list),
    "vision_ceo_direction": ceo_vision_string,         # plain text
    "sec_filing_raw": json.dumps(sec_data_dict),
    "quotes": json.dumps(simple_quotes_list),
    "quote_count": len(all_quotes),                    # integer
    "certified_ceo_vision": ceo_vision_string,         # plain text
    "vision_raw": json.dumps(vision_raw_dict),
    "challenges_raw": json.dumps(challenges_raw_dict),
    "challenges_public": challenges_summary_string,    # plain text
    "sec_revenue": revenue_string,                     # plain text
    "sec_revenue_growth": growth_string,               # plain text
    "sec_ma_intent": ma_intent_string,                 # plain text

    # Postgres text[] arrays — use to_pg_array()
    "sec_risk_factors_relevant": to_pg_array(risk_factors),
    "vision_strategic_priorities": to_pg_array(strategic_priorities),
}

IMPORTANT: Use requests.patch() with the Supabase REST API. Do NOT use requests.post() or create new rows.

url = f"{SUPABASE_URL}/buyer_dossiers?id=eq.{row_id}"
resp = requests.patch(url, headers=SUPABASE_HEADERS, json=patch, timeout=30)

Step 3: Extraction Prompt — Adapt for Company Type

For PUBLIC companies with earnings calls:

Extract ALL content relevant to: HR technology, talent management, workforce planning,
AI in HR, community/engagement platforms, content/media strategy, practitioner networks,
data/analytics, and M&A activity.

Focus topics: {company_name}'s relationship to HR, workforce, talent, learning,
community, content, data, AI, and any acquisitions in adjacent spaces.

For PE_BACKED firms:

Extract ALL content relevant to: investment thesis for HR technology, community platforms,
content/media businesses, workforce/talent companies. Look for: portfolio company
investments in HR/community/data space, partner quotes about thesis areas, stated
investment criteria, and any mentions of community, engagement, content platforms.

For PRIVATE companies:

Extract ALL content relevant to: strategic direction for HR/workforce/community,
CEO vision statements, product roadmap signals, competitive positioning,
partnership/acquisition signals, and community/engagement strategy.

Step 4: Golden Nugget Synthesis Prompt

For each company, after extraction, run this through Claude CLI (synthesizer):

You are a cold-call strategist for Next Chapter M&A Advisory. We represent HR.com —
a company with 2M+ HR practitioners, community engagement data, content, and events.

Below are quotes from {company_name}'s public statements. Identify the 5 best
"golden nugget" quotes for a cold call to {company_name}'s corp dev team.
These should demonstrate their need for what HR.com offers.

Return JSON with: golden_nuggets (array), ceo_vision_summary, challenges_summary,
ma_appetite_summary

Step 5: Error Resilience — Run All Night

The script MUST be resilient:
- Try/except around each company — if one fails, log the error and continue to the next
- Save progress after each company — write a progress file so we can resume
- Checkpoint file: data/transcript_mining_progress.json with {slug: "done", slug2: "failed:reason"}
- Resume mode: If the progress file exists, skip companies already marked "done"
- Retry failed companies at the end with a second pass
- Graceful shutdown: Catch KeyboardInterrupt, save progress, exit cleanly

Step 6: Cost Tracking

Log to dossier_cost_log after each company:

{
    "company_slug": slug,
    "stage": "transcript_mining",
    "cost_usd": cost,
    "details": json.dumps({...})
}

EXISTING CODE TO USE

All code is in ~/Projects/dossier-pipeline/:

config.py — All API keys, Supabase config, LLM provider registry, role assignments
lib/exa_client.py — ExaClient with .search(), .search_multi(), .format_research()
lib/llm_client.py — LLMClient with .generate(), .generate_json(), and get_client(role)
get_client("researcher_1") → Mistral (extraction, scored 96/100)
get_client("researcher_2") → DeepSeek (cross-check, scored 95.3/100)
get_client("synthesizer") → Claude CLI (narrative, best quality, FREE)
scripts/sap_transcript_miner.py — The SAP-specific miner (reference implementation)

DO NOT:
- Create new API accounts or Supabase projects
- Use the Anthropic API — use claude -p CLI via the synthesizer role
- Create new tables — only UPDATE existing rows in buyer_dossiers
- Store data outside ~/Projects/dossier-pipeline/data/ or Supabase
- Skip cost logging

RUNNING IT

cd ~/Projects/dossier-pipeline
python3 scripts/batch_transcript_miner.py          # Full run, all 63
python3 scripts/batch_transcript_miner.py --resume  # Resume from checkpoint
python3 scripts/batch_transcript_miner.py --limit 5 # Test with 5 companies
python3 scripts/batch_transcript_miner.py --company "ServiceNow"  # Single company

EXPECTED OUTPUT

After completion:
- 63 rows in buyer_dossiers updated with transcript data
- 62 JSON files in data/buyer_dossiers/ (SAP already exists at data/sap_investor_transcripts.json)
- Cost logged for each company in dossier_cost_log
- Console output showing progress per company
- Final summary: companies processed, total quotes, total cost, any failures

QUALITY GATE

Before marking a company "done":
1. quote_count must be > 0 (at least 1 quote found)
2. vision_ceo_direction must be non-empty
3. golden_nuggets must have at least 1 entry
4. If all three fail (no quotes found at all), mark as "needs_manual_review" in the progress file — don't write empty data to Supabase

For PE firms and private companies, the bar is lower — even 1-2 quotes from interviews or press releases count. The golden nugget synthesis can work with less data by focusing on investment thesis alignment rather than earnings call quotes.

THE 63 TARGETS

PUBLIC (44) — Mine earnings calls Q1 2023 to Q4 2025:

Accenture, ADP, Alight, Amazon (AWS), Anthropic, BambooHR, Ceridian (Dayforce), Cohere, Culture Amp, Deel, Deloitte, Dotdash Meredith, EY, G2, Gainsight, Google, Greenhouse, HubSpot, IBM, Indeed, Industry Dive, Informa, Lattice, Manpower (ManpowerGroup), Microsoft, OpenAI, Oracle HCM, Paychex, Paycom, Paylocity, PWC, RELX (Reed Elsevier), Remote, Salesforce, SAP SuccessFactors (SKIP — already done), ServiceNow, Shopify, TechTarget, Thomson Reuters, UKG (Ultimate Kronos), Wiley, Wolters Kluwer, Workday, ZoomInfo

PE_BACKED (12) — Mine partner interviews, fund letters, portfolio announcements:

Benchmark, Clearlake Capital, Francisco Partners, Hellman & Friedman, Insight Partners, KKR, Norwest Venture Partners, Silver Lake, Spark Capital, Thoma Bravo, Thrive Capital, Vista Equity Partners

PRIVATE (7) — Mine CEO interviews, press releases, conference talks:

ATD, Bevy, Khoros, Mighty Networks, Rippling, SHRM, WorldatWork

IMPORTANT NOTES

SAP is already done — skip the row with slug sap-successfactors
Some "PUBLIC" companies are actually private (Anthropic, Cohere, OpenAI, BambooHR, Culture Amp, Deel, Greenhouse, G2, Gainsight, Indeed, Lattice, Remote, Rippling). These don't have earnings calls. Treat them like PRIVATE — search for CEO interviews, funding announcements, conference talks.
The Big 4 consulting firms (Deloitte, EY, PWC) are partnerships, not public companies. No earnings calls. Search for: partner interviews, thought leadership, industry reports, HR tech practice announcements.
Holding companies — for Industry Dive (owned by Informa) and Indeed (owned by Recruit Holdings), search the parent company's earnings for mentions of the subsidiary.
AI companies (Anthropic, Cohere, OpenAI) — search for: CEO talks at conferences, blog posts, partnership announcements, product launches related to enterprise/HR use cases.
Pipe prompts to Claude CLI via stdin (input=prompt), NOT as CLI arguments. Large prompts (15K+ chars) will cause shell argument limits to be exceeded.
Commit and push after the run completes: git add data/ scripts/ && git commit -m "batch transcript mining: 63 targets" && git push