Run investor transcript mining for all 63 buyer targets in the buyer_dossiers table. For each company, search for earnings call transcripts, investor presentations, and public filings. Extract quotes, financial data, strategic signals, challenges language, CEO vision, and M&A appetite. Synthesize "golden nugget" quotes for cold-call openers. Update each company's existing row in Supabase. Save a local JSON per company. Run autonomously through all 63 targets — do NOT stop to ask questions.
~/Projects/dossier-pipeline/
These are NOT suggestions — these are fixes from bugs encountered during the SAP run:
Two columns are Postgres text[] arrays, NOT JSONB:
- sec_risk_factors_relevant → must be Postgres array literal: '{"item 1","item 2"}'
- vision_strategic_priorities → must be Postgres array literal: '{"item 1","item 2"}'
All other JSON-looking columns (certified_key_quotes, quotes, challenges_earnings_language, sec_filing_raw, vision_raw, challenges_raw) are text columns that store JSON strings. Use json.dumps() for these.
Implementation pattern:
# For text[] columns — Postgres array literal
def to_pg_array(items):
"""Convert Python list to Postgres array literal string."""
if not items:
return '{}'
escaped = [item.replace('"', '\\"') for item in items]
return '{' + ','.join(f'"{e}"' for e in escaped) + '}'
patch['sec_risk_factors_relevant'] = to_pg_array(["risk 1", "risk 2"])
patch['vision_strategic_priorities'] = to_pg_array(["priority 1", "priority 2"])
# For text/JSONB columns — JSON string
patch['certified_key_quotes'] = json.dumps(quotes_list)
patch['quotes'] = json.dumps(quotes_list)
Not all 63 companies have earnings calls. Adapt the search strategy per buyer_type:
PUBLIC companies (44): Have quarterly earnings calls. Search for transcripts on SeekingAlpha, Motley Fool, Nasdaq, TipRanks, Yahoo Finance. Also search for investor day presentations and annual reports/10-K/20-F filings.
PE_BACKED firms (12): No earnings calls. Instead search for: portfolio company announcements, fund letters to LPs, conference presentations, podcast interviews with partners, press releases about investments in HR tech, community, or data companies. Search broader: TechCrunch, Business Insider, PE Hub, Pitchbook news.
PRIVATE companies (7): No earnings calls. Search for: CEO interviews, podcast transcripts, conference keynotes, blog posts from leadership, press releases, funding announcements. These often have the most candid strategic statements.
Some companies need special search handling:
- SAP SuccessFactors → already done, SKIP
- Oracle HCM → search "Oracle" not "Oracle HCM" for earnings
- Ceridian (Dayforce) → search both "Ceridian" and "Dayforce" (rebranded)
- UKG (Ultimate Kronos) → search "UKG" and "Ultimate Kronos Group"
- Amazon (AWS) → search "Amazon" earnings, focus on AWS segments
- RELX (Reed Elsevier) → search both names
- Manpower (ManpowerGroup) → search "ManpowerGroup"
- Dotdash Meredith → search "Dotdash Meredith" and "IAC" (parent)
- Industry Dive → was acquired by Informa, search both
Mistral rate-limits at 30 RPM. During the SAP run, we hit a 429 on the 6th call and auto-fell back to DeepSeek.
- Add 2-second delay between Mistral calls (not just the 0.3s Exa delay)
- The fallback chain in lib/llm_client.py handles this automatically — Mistral → DeepSeek
- Budget: Exa ~$0.012/quarter/company, Mistral ~$0.002/call. Full run ≈ $2-4 total.
The default 2000 chars was too short for transcript snippets. Use 8000 for transcript searches to capture full quote context.
Mistral's context is 30K chars. When combining multiple Exa results, cap the combined text at 25K chars to leave room for the extraction prompt.
Write scripts/batch_transcript_miner.py that:
buyer_dossiers tablequote_count > 0 (already mined — currently only SAP)buyer_typegenerate_json() for clean parsingdata/buyer_dossiers/{slug}_transcripts.jsondossier_cost_log[14/63] ServiceNow — 23 quotes, 5 golden nuggets, $0.18For each company, update these columns in the existing row:
patch = {
# Text columns (JSON strings)
"certified_key_quotes": json.dumps(certified_quotes_list),
"challenges_earnings_language": json.dumps(challenges_list),
"vision_ceo_direction": ceo_vision_string, # plain text
"sec_filing_raw": json.dumps(sec_data_dict),
"quotes": json.dumps(simple_quotes_list),
"quote_count": len(all_quotes), # integer
"certified_ceo_vision": ceo_vision_string, # plain text
"vision_raw": json.dumps(vision_raw_dict),
"challenges_raw": json.dumps(challenges_raw_dict),
"challenges_public": challenges_summary_string, # plain text
"sec_revenue": revenue_string, # plain text
"sec_revenue_growth": growth_string, # plain text
"sec_ma_intent": ma_intent_string, # plain text
# Postgres text[] arrays — use to_pg_array()
"sec_risk_factors_relevant": to_pg_array(risk_factors),
"vision_strategic_priorities": to_pg_array(strategic_priorities),
}
IMPORTANT: Use requests.patch() with the Supabase REST API. Do NOT use requests.post() or create new rows.
url = f"{SUPABASE_URL}/buyer_dossiers?id=eq.{row_id}"
resp = requests.patch(url, headers=SUPABASE_HEADERS, json=patch, timeout=30)
For PUBLIC companies with earnings calls:
Extract ALL content relevant to: HR technology, talent management, workforce planning,
AI in HR, community/engagement platforms, content/media strategy, practitioner networks,
data/analytics, and M&A activity.
Focus topics: {company_name}'s relationship to HR, workforce, talent, learning,
community, content, data, AI, and any acquisitions in adjacent spaces.
For PE_BACKED firms:
Extract ALL content relevant to: investment thesis for HR technology, community platforms,
content/media businesses, workforce/talent companies. Look for: portfolio company
investments in HR/community/data space, partner quotes about thesis areas, stated
investment criteria, and any mentions of community, engagement, content platforms.
For PRIVATE companies:
Extract ALL content relevant to: strategic direction for HR/workforce/community,
CEO vision statements, product roadmap signals, competitive positioning,
partnership/acquisition signals, and community/engagement strategy.
For each company, after extraction, run this through Claude CLI (synthesizer):
You are a cold-call strategist for Next Chapter M&A Advisory. We represent HR.com —
a company with 2M+ HR practitioners, community engagement data, content, and events.
Below are quotes from {company_name}'s public statements. Identify the 5 best
"golden nugget" quotes for a cold call to {company_name}'s corp dev team.
These should demonstrate their need for what HR.com offers.
Return JSON with: golden_nuggets (array), ceo_vision_summary, challenges_summary,
ma_appetite_summary
The script MUST be resilient:
- Try/except around each company — if one fails, log the error and continue to the next
- Save progress after each company — write a progress file so we can resume
- Checkpoint file: data/transcript_mining_progress.json with {slug: "done", slug2: "failed:reason"}
- Resume mode: If the progress file exists, skip companies already marked "done"
- Retry failed companies at the end with a second pass
- Graceful shutdown: Catch KeyboardInterrupt, save progress, exit cleanly
Log to dossier_cost_log after each company:
{
"company_slug": slug,
"stage": "transcript_mining",
"cost_usd": cost,
"details": json.dumps({...})
}
All code is in ~/Projects/dossier-pipeline/:
config.py — All API keys, Supabase config, LLM provider registry, role assignmentslib/exa_client.py — ExaClient with .search(), .search_multi(), .format_research()lib/llm_client.py — LLMClient with .generate(), .generate_json(), and get_client(role)get_client("researcher_1") → Mistral (extraction, scored 96/100)get_client("researcher_2") → DeepSeek (cross-check, scored 95.3/100)get_client("synthesizer") → Claude CLI (narrative, best quality, FREE)scripts/sap_transcript_miner.py — The SAP-specific miner (reference implementation)DO NOT:
- Create new API accounts or Supabase projects
- Use the Anthropic API — use claude -p CLI via the synthesizer role
- Create new tables — only UPDATE existing rows in buyer_dossiers
- Store data outside ~/Projects/dossier-pipeline/data/ or Supabase
- Skip cost logging
cd ~/Projects/dossier-pipeline
python3 scripts/batch_transcript_miner.py # Full run, all 63
python3 scripts/batch_transcript_miner.py --resume # Resume from checkpoint
python3 scripts/batch_transcript_miner.py --limit 5 # Test with 5 companies
python3 scripts/batch_transcript_miner.py --company "ServiceNow" # Single company
After completion:
- 63 rows in buyer_dossiers updated with transcript data
- 62 JSON files in data/buyer_dossiers/ (SAP already exists at data/sap_investor_transcripts.json)
- Cost logged for each company in dossier_cost_log
- Console output showing progress per company
- Final summary: companies processed, total quotes, total cost, any failures
Before marking a company "done":
1. quote_count must be > 0 (at least 1 quote found)
2. vision_ceo_direction must be non-empty
3. golden_nuggets must have at least 1 entry
4. If all three fail (no quotes found at all), mark as "needs_manual_review" in the progress file — don't write empty data to Supabase
For PE firms and private companies, the bar is lower — even 1-2 quotes from interviews or press releases count. The golden nugget synthesis can work with less data by focusing on investment thesis alignment rather than earnings call quotes.
Accenture, ADP, Alight, Amazon (AWS), Anthropic, BambooHR, Ceridian (Dayforce), Cohere, Culture Amp, Deel, Deloitte, Dotdash Meredith, EY, G2, Gainsight, Google, Greenhouse, HubSpot, IBM, Indeed, Industry Dive, Informa, Lattice, Manpower (ManpowerGroup), Microsoft, OpenAI, Oracle HCM, Paychex, Paycom, Paylocity, PWC, RELX (Reed Elsevier), Remote, Salesforce, SAP SuccessFactors (SKIP — already done), ServiceNow, Shopify, TechTarget, Thomson Reuters, UKG (Ultimate Kronos), Wiley, Wolters Kluwer, Workday, ZoomInfo
Benchmark, Clearlake Capital, Francisco Partners, Hellman & Friedman, Insight Partners, KKR, Norwest Venture Partners, Silver Lake, Spark Capital, Thoma Bravo, Thrive Capital, Vista Equity Partners
ATD, Bevy, Khoros, Mighty Networks, Rippling, SHRM, WorldatWork
sap-successfactorsinput=prompt), NOT as CLI arguments. Large prompts (15K+ chars) will cause shell argument limits to be exceeded.git add data/ scripts/ && git commit -m "batch transcript mining: 63 targets" && git push