Hello There!

Welcome to our platform.

About Us

We find what breaks
in GPU infrastructure,
and we fix it.

Caladrius is root-cause analysis and closed-loop remediation for GPU infrastructure. When a job stalls or slows, it traces the problem below the application layer, across device, fabric, storage, and workload, drives the fix on your approval, and verifies it held.

WHO WE ARE

Caladrius builds root-cause analysis and closed-loop remediation for GPU infrastructure. When a training or inference job stalls or slows, the cause can sit in the GPU devices, the fabric, storage and checkpoints, or the workload itself, and resolving it today means an engineer correlating signals across all of those by hand. Caladrius does that work: it identifies the actual problem, drives the fix, and confirms it worked.

We have built this class of software before. The team founded ReleaseIQ, an SRE and ops-automation platform acquired by CloudBees, after earlier work at VMware and WebLogic, and runs SRE operations through Awan Infotech. GPU infrastructure is the same discipline pointed at a newer problem: the failures live below the application layer, where general tools see a black box.

Our team

Seetharam Param

Seetharam Param

Co-founder & CEO

Ex-CEO / co-founder, ReleaseIQ (acq. CloudBees). Ex-VMware, ex-WebLogic.

Sudhish Mangalasary

Sudhish Mangalasary

Head of Engineering

Ex-Director, CloudBees. Head of Eng, ReleaseIQ. Ex-Deputy GM, HCL America.

Chidambara Rajan

Chidambara Rajan

Co-founder, India Ops

ITOps / CloudOps. Founder & MD, Awan Infotech (SRE services).

Our Advisors

Sandhya Sridharan

Sandhya Sridharan

JPMorgan, ex-VMware
Scott Hammond

Scott Hammond

ex-Node.js Foundation, ex-CNCF, ex-Cisco
Siddanagouda Sankanagouda

Siddanagouda Sankanagouda

Sr. Director, Intel; co-founder ReleaseIQ
OUR MISSION

GPU infrastructure fails in ways general tooling can't see. A slowdown or failure in one layer can silently stall thousands of GPUs, with the root cause buried below the application layer, and resolving it today costs hours of hand-correlation. Our mission is to close that gap: a named root cause, a driven fix, and a verified outcome for every GPU incident.

Diagnose into the cluster

General tools stop at the application layer. Caladrius correlates device, fabric, storage, and workload signals to pinpoint what actually broke.

Close the loop

Diagnosis is half the job. Caladrius drives the fix on your approval and verifies it held, so incidents end in confirmed outcomes, not more alerts.

HOW WE BUILD

What the product stands on

Root cause, not symptoms

Restarts and reroutes keep a job limping. We go after the actual cause, so the same failure doesn't come back tomorrow.

No fix without proof

Every remediation runs with your approval and is verified against what the system actually did. If it didn't hold, you know.

Native to GPU infrastructure

Device, fabric, scheduler, training, and serving failure modes are the product's home ground, not an integration afterthought.