😺 🎙️ We Talked to the People Who Secretly Train the AI You Use Every Day


Welcome, humans.
You know the thumbs up / thumbs down button on ChatGPT? The one that asks if the response was good?
Turns out there are people who get paid to do that exact thing
…
except way more rigorously, for millions of hours, across every major AI lab.
Well, the company behind a huge chunk of that work just hit
$1.2B in revenue without ever raising a dime of VC money.
In our
latest podcast episode
, we sit down with Nick Heiner, VP of Product at
Surge AI
, to talk about the secret training grounds where AI models learn to actually do real work—and why even the best ones still fall apart almost half the time.

Here's some of our favorite parts
(4:38)
What a "reinforcement learning environment" actually is—explained with a golf analogy even your boss would get.
(16:54)
Why the best AI models still fail ~40% of workplace tasks—and where the failures cluster.
(23:36)
200+ Wall Street experts graded GPT-5, Claude, and Gemini on real finance work. The models treated it like a college exam.
(31:24)
Reward hacking: how AI models game the system like a kid who stops hitting their sister by kicking instead.
(44:35)
Nick's bold prediction: a $1B company with one human employee by 2030.
(48:17)
Why AI writing all sounds the same—and the research to fix it.
Bottom line:
The models you use every day are only as good as the training environments behind them. Right now, those environments are the biggest bottleneck in AI—and the teams building them are quietly shaping what your AI can and can't do.
If you've ever wondered how AI models actually learn to do real work (not just answer trivia questions, but navigate messy spreadsheets, write actual reports, and handle angry customers) this is the episode to watch.
Nick breaks down the entire AI training stack in plain English: what pre-training, post-training, and reinforcement learning actually mean (with a golf analogy that'll stick with you), why the thumbs up/down button on ChatGPT is literally gathering training data, how Surge builds simulated companies to test whether AI can handle real jobs end-to-end, and why the quality of the "reward signal"—not the model itself—is the real bottleneck holding everything back.
Whether you're building with AI, investing in it, or just trying to understand why your chatbot still makes bizarre mistakes, this one fills in the gaps.
Watch and/or Listen now:
YouTube
|
Spotify
|
Apple Podcasts
P.S.
Surge just dropped
Riemann-bench
—a math benchmark built with Ivy League professors where every frontier model scores below 10%. For context, Surge built OpenAI's original GSM8K math benchmark. That one went from unsolvable to saturated in a few years.
If Riemann-bench follows the same path, the implications are way bigger than math scores…
Keep scrolling for…
details on our live episode with
Dan Shipper
(CEO of Every) on agent-native engineering tomorrow at 10 a.m. PT, Dan's must-watch interview with
Mike Krieger
(co-founder of Instagram, now at Anthropic) on building products in the agent era, four more recent episodes you might have missed, including Proton, Carta, NVIDIA, and SES AI, and a ton of resources from Surge AI you’ll love.
Real quick:
Want to see your AI-adjacent product or service show up right here, below these podcast promos? Click the button below to advertise to our 650K readers!

THIS EPISODE WAS MADE POSSIBLE BY OUR PARTNER…
Dell AI Factory with NVIDIA

When we talk about AI in the enterprise, there's this huge wave of optimism. 84% of business leaders say AI is going to transform their industry. That's massive.
But here's the reality:
93% are struggling to actually make it work.
That's the gap. And that's exactly what
Dell AI Factory with NVIDIA
is built to close.
Dell calls it the world's broadest AI portfolio, and that's not marketing fluff. We're talking everything from AI-ready PCs to servers, storage, networking, and services, all designed to work together.
But what really matters is this:
they've already helped implement over 3,000 real-world AI deployments. So this is proven, operational AI.
And they don't just drop hardware at your doorstep and wish you luck. Dell brings expert services at every stage—strategy, deployment, scaling—so you're not stuck in pilot mode wondering why nothing's moving.
If your organization believes AI is the future, but you're still trying to bridge that execution gap,
check out The Dell AI Factory with NVIDIA
.
Learn more at/YourWayToAI
.

🔴
LIVE THIS THURSDAY @ 10AM PT | 1pm ET: Dan Shipper, CEO of Every
Dan Shipper, CEO of
Every
, vibe coded an
agentic document editor
between meetings. It went viral. Then it went down.
Then it took over his entire week.
This Thursday,
he's joining us live
to break down what "
agent-native engineering
" actually looks like. That’s the framework his 15-person team uses to ship AI products at a pace most companies can't match, all with virtually zero hand-written code.

Click the link to go to YouTube, then on YouTube, click “Notify Me” to get notified when we go live
We'll also get into
Every's full product suite:
Spiral
(automatic style guides from your writing),
Sparkle
(AI file organization for Mac),
Cora
(AI email assistant, now on iOS),
Monologue
(voice dictation that writes the way you talk),
Proof
(the aforementioned agent-first document editor that broke the internet), and maybe we’ll even get to ask Dan a question about the brand-new tool launching this week:
Plus One
(
but they have their own livestream dedicated to that on Friday
).
And as usual, we’ll take questions from the crowd, so this is your chance to ask Dan “
Every” thing you ever wanted to know!
As Every says, they are “the only subscription you need to stay at the edge of AI” so you won’t want to miss this one!
👉
Join us live starting at 10 AM PT | 1pm ET:
YouTube
|
LinkedIn
|
X
🎧 While you wait…
Dan just sat down with
Mike Krieger
(co-founder of Instagram, now VP of Product at Anthropic Labs) for a conversation that's essential viewing.
Mike had Claude rebuild Bourbon—Instagram's failed predecessor—in two hours, feature complete with filters. They dig into why AI makes it dangerously easy to overbuild V1, how Anthropic's labs team kills features as aggressively as they ship them, and why the best product teams right now pair "founder-level conviction" people with senior systems engineers—not big teams.

More from The Neuron Podcast…
Your AI Chats Can Be Subpoenaed. His Can't.
— Proton's Eamonn Maguire on the privacy nightmare hiding in every AI chat.
YouTube
|
Spotify
|
Apple Podcasts
Solo Founders Are Taking Over (Carta's Data Proves It)
— Carta's CMO reveals what's really happening in the startup world.
YouTube
|
Spotify
|
Apple Podcasts
NVIDIA's Kari Briski Breaks Down Nemotron 3 (GTC 2026)
— Recorded live at GTC, the future of NVIDIA's open-source AI strategy.
YouTube
|
Spotify
|
Apple Podcasts
This AI Agent Compressed 8 Years of R&D Into 2 Weeks
— SES AI's CEO on AI agents transforming scientific discovery.
YouTube
|
Spotify
|
Apple Podcasts
And if you
haven’t subscribed yet, please do!
Click the image below to go to our channel and hit “subscribe” to get notified right when new videos go live.

We have a goal to hit 50K subscribers by the end of the year (if not 100K), and
we’re only 33K away!
If you like learning about AI, and already watch some of our videos,
do us a favor and click here to subscribe
today.
Dive deeper with these resources:
The Hierarchy of Agentic Capabilities (research paper)
— Surge's RL environment research showing the five core capabilities all agents need to master.
EnterpriseBench: CoreCraft
— Surge built a simulated startup with 2,500+ entities and 23 tools, then turned frontier models loose on real customer support tasks. Even GPT-5.2 at max reasoning only solved ~43%. Models hallucinated refunds, leaked PII, and got stuck in infinite logic loops.
Hemingway-bench AI Writing Leaderboard
— Nick's team built a writing benchmark graded by expert human writers instead of auto-graders. Turns out models that top other leaderboards often produce over-the-top prose where every sentence is a metaphor. Current leader? Gemini 3.1 Pro, with Opus 4.6 close behind.
Riemann-bench: Moonshot Mathematics
— Surge's newest benchmark, designed with Ivy League math professors and PhD IMO medalists. These are problems that took the authors themselves weeks to solve. Every frontier model scores below 10%. Surge originally built OpenAI's GSM8K math benchmark—this is the next frontier.
LMArena is a cancer on AI
— Nick's argument for why the most popular AI leaderboard is actually making models worse.
Nick's Sonnet 4.5 Product Review
— 100+ hours with the model, from Surge's product perspective.
Nick's Gemini 3.1 Review
— "Not leading edge, also in love with me."
Hilarious.
Nick's Substack
— Independent benchmarks, essays on the future of work, and dispatches from someone building AI products at Surge every day.
Surge AI Blog
— for more like this.
When Is It OK to Slop Your Colleagues?
— Nick's latest rule of thumb for the AI-assisted workplace:
"If you can't independently verify the quality of the content, don't send it to someone else without a disclaimer."
Required reading for anyone using AI at work!
Stay curious,
The Neuron Team
![]() |
|

P.P.S:
Love the newsletter, but don’t want to receive these podcast announcement emails? Don’t unsubscribe —
adjust your preferences to opt out of them here instead
.

