We find what breaks
in GPU infrastructure,
and we fix it.
Caladrius is root-cause analysis and closed-loop remediation for GPU infrastructure. When a job stalls or slows, it traces the problem below the application layer, across device, fabric, storage, and workload, drives the fix on your approval, and verifies it held.
Caladrius builds root-cause analysis and closed-loop remediation for GPU infrastructure. When a training or inference job stalls or slows, the cause can sit in the GPU devices, the fabric, storage and checkpoints, or the workload itself, and resolving it today means an engineer correlating signals across all of those by hand. Caladrius does that work: it identifies the actual problem, drives the fix, and confirms it worked.
We have built this class of software before. The team founded ReleaseIQ, an SRE and ops-automation platform acquired by CloudBees, after earlier work at VMware and WebLogic, and runs SRE operations through Awan Infotech. GPU infrastructure is the same discipline pointed at a newer problem: the failures live below the application layer, where general tools see a black box.
Our team
Seetharam Param
Co-founder & CEOEx-CEO / co-founder, ReleaseIQ (acq. CloudBees). Ex-VMware, ex-WebLogic.
Sudhish Mangalasary
Head of EngineeringEx-Director, CloudBees. Head of Eng, ReleaseIQ. Ex-Deputy GM, HCL America.
Chidambara Rajan
Co-founder, India OpsITOps / CloudOps. Founder & MD, Awan Infotech (SRE services).
Our Advisors
Sandhya Sridharan
JPMorgan, ex-VMware
Scott Hammond
ex-Node.js Foundation, ex-CNCF, ex-Cisco
Siddanagouda Sankanagouda
Sr. Director, Intel; co-founder ReleaseIQGPU infrastructure fails in ways general tooling can't see. A slowdown or failure in one layer can silently stall thousands of GPUs, with the root cause buried below the application layer, and resolving it today costs hours of hand-correlation. Our mission is to close that gap: a named root cause, a driven fix, and a verified outcome for every GPU incident.
Diagnose into the cluster
General tools stop at the application layer. Caladrius correlates device, fabric, storage, and workload signals to pinpoint what actually broke.
Close the loop
Diagnosis is half the job. Caladrius drives the fix on your approval and verifies it held, so incidents end in confirmed outcomes, not more alerts.
What the product stands on
Root cause, not symptoms
Restarts and reroutes keep a job limping. We go after the actual cause, so the same failure doesn't come back tomorrow.
No fix without proof
Every remediation runs with your approval and is verified against what the system actually did. If it didn't hold, you know.
Native to GPU infrastructure
Device, fabric, scheduler, training, and serving failure modes are the product's home ground, not an integration afterthought.