|
Gaurav Shinde
Hi, I'm Gaurav. I am a PhD student at University of
Maryland, Baltimore County advised by
Prof. Nirmalya Roy. My research focuses on embodied AI and collaborative perception for robotic systems.
Previously, I worked as a Research Assistant at MiCoSys
Lab with
Prof. Saptarshi Sengupta, where I developed deep learning techniques for
predicting the remaining useful life of lithium-ion
batteries.
LinkedIn
/
Google Scholar
/
GitHub
/
CV
Currently seeking a
summer internship
to apply my expertise in
robotics,
embodied AI, &
computer vision
to solve real-world problems.
|
|
Recent News
|
May 2026
|
Reviewer for WIREs DMKD and IEEE TMC.
|
|
Apr 2026
|
Organized NSF TRACE Robotic Testbed Workshop.
|
|
Apr 2026
|
VLS-RR accepted at IEEE/ACM CHASE 2026.
|
|
Mar 2026
|
Paper accepted at CVPR GRAIL-V Workshop.
|
|
Mar 2026
|
2 papers accepted at IEEE WoWMoM 2026.
|
|
Jan 2026
|
Incuted as a full member of the Sigma Xi Research Honor Society.
|
|
Dec 2025
|
Fed-CASQ paper accepted at IEEE PerCom 2026.
|
|
Sep 2025
|
Reviewer for Pattern Recognition Letters.
|
|
Sep 2025
|
Successfully cleared my PhD comprehensive exam.
|
|
Sep 2025
|
RespFormer paper accepted at ICMLA.
|
|
Aug 2025
|
TPC member for International Conference on
COMmunication Systems & NETworkS (COMSNETS).
|
|
Aug 2025
|
Demonstrated remote robotics experimentation at
Graces Quarters under ArtiAMAS project.
|
|
Jun 2025
|
Efficient VLM Survey
accepted at WIREs Data Mining and Knowledge
Discovery (DMKD).
|
|
Apr 2025
|
E2RespUNet
paper accepted at IEEE SmartComp.
|
|
Apr 2025
|
CoOpTex
paper accepted at IEEE DCOSS-IoT.
|
|
Mar 2025
|
Awarded NSF Travel Grant for IEEE PerCom.
|
|
Jan 2025
|
TAVIC-DAS
accepted at IEEE PerCom Workshop.
|
|
Aug 2024
|
Joined University of Maryland, Baltimore County as
a PhD Researcher.
|
|
|
|
|
CALIBRA: Calibration-Aware Multi-Agent Verification for Contactless Physiological Monitoring
Shadman Sakib, Gaurav Shinde,
Nirmalya Roy
CVPR GRAIL-V, 2026
paper
Contactless physiological monitoring offers a promising
path toward low-burden health sensing, but reliable infer-
ence from short video segments remains difficult when vi-
sual evidence is weak, corrupted by motion, or spectrally
ambiguous. We present CALIBRA, a calibration-aware
multi-agent framework for contactless respiratory rate (RR)
estimation that treats prediction as an evidence-grounded
verification problem rather than a one-shot prompting task.
The framework operates on a structured artifact bundle
with lightweight signal-derived candidate rates, and per-
forms routing, specialized vision-language evidence analy-
sis, critic-based consistency checking, targeted repair, and
calibrated arbitration with optional abstention. On our in-
house benchmark, it consistently outperforms prompting-
only baselines across multiple backbones. With Gemini-2.5-
Flash, it reduces mean absolute error (MAE) from 5.21 to
3.42, corresponding to a 34.4% reduction over zero-shot
prompting and a 16.8% reduction over chain-of-thought
prompting. With GPT-4o, it reduces MAE from 5.48 to 3.78,
a 31.0% reduction over zero-shot prompting. It also im-
proves grounding and consistency in judge-based evalua-
tion, indicating better evidence alignment beyond numeri-
cal accuracy alone. Under resource constraints, a routed
configuration maintains 93% coverage at 4.1 s average la-
tency and 1.6 GB peak memory. These results show that
robust contactless physiological inference depends not only
on stronger multimodal backbones, but also on explicit ver-
ification of whether predictions are supported by the avail-
able evidence.
|
|
|
Decoupling Perception and Reasoning for
Contactless Respiratory Rate with Vision Language
and Small Language Models
Shadman Sakib, Gaurav Shinde,
Nirmalya Roy
IEEE/ACM CHASE, 2026
paper
Contactless respiratory monitoring from video is a
promising alternative to wearable sensors but remains sensitive
to motion artifacts and variations in viewpoint and lighting,
which degrade the reliability of respiratory rate (RR) estimates.
Traditional signal processing methods often struggle to separate
physiological motion from visual artifacts, while large multimodal models (LMMs) can reason about these artifacts but are
often too computationally demanding for privacy-sensitive or
embedded deployments. This work aims to improve robustness
and deployability by explicitly decoupling visual perception from
downstream reasoning. We propose VLS-RR (Visual–Language–
SLM Respiratory Rate) auditor, a three-stage framework where
video segments are converted into fused motion signals and
summary plots. A vision–language model (VLM) then produces a
textual description of rhythm, breath count, and artifacts. Finally,
small language models (SLMs) perform chain-of-thought (CoT)
auditing over this evidence and the signal-derived RR candidates.
We evaluate VLS-RR on a 50-video belt-synchronized dataset
spanning diverse breathing patterns, viewpoints, and lighting
conditions. Compared with signal-only baselines, VLS-RR reduces segment-level MAE by ≈ 30% and improves R
2
from 0.85
to 0.92. It also outperforms numeric-only SLM baselines with an
additional ≈ 25% MAE reduction. Finally, on an embedded edge
device, compact SLM auditors run with 0.82–1.71 s latency per
6 s window, indicating that decoupling perception and reasoning
enables accurate, resource-efficient RR estimation from video.
|
|
|
Fed-CASQ: Enhancing Class-Wise Accuracy in
Pervasive Federated Learning with Class-Aware
Scaling and Quantization
Emon Dey, Anuradha Ravi, Gaurav Shinde,
Garvit Chugh, Indrajeet Ghosh, Archan Misra, Nirmalya Roy
IEEE PerCom, 2026
paper
Federated Learning (FL) enables collaborative machine learning across decentralized devices and data sources,
but resource constraints on pervasive devices necessitate efficient
model compression. Existing approaches, such as quantization for
on-device training, often degrade accuracy, especially for classes
that are difficult to learn due to imbalance, poor-quality samples,
or inherent complexity. This results in persistent accuracy gaps
across classes. We propose Fed-CASQ’s a novel framework that
couples class-aware strategies into the quantization process to
jointly improve efficiency and accuracy in pervasive FL. Unlike
prior works that address quantization and imbalance separately,
Fed-CASQ adaptively selects quantization levels based on device
resources and leverages Layer-wise Relevance Propagation (LRP)
to assess class-relevant convolutional neural network (CNN)
filters on the client side. An adaptive weight scaling mechanism
is then applied to amplify critical information for low-accuracy
classes before aggregation. At the server, a complementary novel
aggregation strategy mitigates global imbalance across clients,
ensuring that underperforming classes receive proportional attention during model updates. We empirically establish that quantization
directly influences the performance of under sampled (minority)
classes. Experimental results further show that Fed-CASQ substantially narrows the performance gap for low-accuracy classes,
improving their accuracy by ≈30%, while reducing training
latency by over 56% on resource-constrained pervasive devices.
|
|
|
CAViAR: Quality-Aware Vision-and-Radio Fusion for
Relative Range Estimation among Collaborative Autonomous Agents
Gaurav Shinde,
Anuradha Ravi, Jared Lewis, Andre Harrison, Henry Gardiner, MS Anwar, Shadman Sakib, Jade Freeman, Nirmalya Roy
IEEE WoWMoM, 2026
paper
In mission-critical scenarios, autonomous agents often
operate in GPS-denied or GPS-degraded environments, making it chal-
lenging to localize agents relative to themselves and objects of interest
(e.g., potential cover) or adversarial robots. While vision- and radio-
based modalities have individually been used to estimate relative ranges
between agents and surrounding objects, each modality suffers from in-
herent limitations. Vision-based range estimation degrades significantly
under low image overlap, partial visibility, or when an agent is too close
to an object, resulting in effective zoom-in and loss of geometric context.
Conversely, radio-based ranging is susceptible to multipath interference,
signal fading, and environmental variability, which can substantially
reduce range accuracy and reliability. To address these challenges, we
introduce CAViAR, a quality-aware multimodal fusion framework for ac-
curate relative range estimation among collaborative autonomous agents.
CAViAR assigns modality-specific reliability scores and performs sta-
tistical fusion to adaptively weigh vision and radio inputs based on their
estimated quality. The framework employs modality-specific quality
estimators augmented with temporal features and integrates MBConv
blocks to enable efficient feature processing on resource-constrained
robotic platforms. We validate CAViAR on ROSbot 2 and ROSbot 2
Pro platforms using an in-house dataset collected across diverse indoor
and outdoor environments. Experimental results demonstrate that
our approach outperforms single-modality baselines by approximately
21% over vision-only and 36% over radio-only range estimates.
Moreover, CAViAR adapts robustly to variations in scene structure,
viewpoint overlap, and occlusions without requiring fine-tuning on new
environments, highlighting its practicality for real-world deployment.
|
|
|
COHORT: Hybrid RL for Collaborative Large DNN
Inference on Multi-Robot Systems Under Real-Time
Constraints
MS Anwar, Anuradha Ravi, Indrajeet Ghosh,
Gaurav Shinde,
Carl Busart, Nirmalya Roy
IEEE WoWMoM, 2026
paper
Large deep neural networks (DNNs), especially
transformer-based and multimodal architectures, are computationally demanding and challenging to deploy on resourceconstrained edge platforms like field robots. These challenges
intensify in mission-critical scenarios (e.g., disaster response),
where robots must collaborate under tight constraints on
bandwidth, latency, and battery life, often without infrastructure or server support. To address these limitations, we
present COHORT, a collaborative DNN inference and taskexecution framework for multi-robot systems built on the
Robotic Operating System (ROS). COHORT employs a hybrid
offline–online reinforcement learning (RL) strategy to dynamically schedule and distribute DNN module execution across
robots. Our key contributions are threefold: (a) Offline RL policy learning combined with Advantage-Weighted Regression
(AWR), trained on auction-based task allocation data from
heterogeneous DNN workloads across distributed robots, (b)
Online policy adaptation via Multi-Agent PPO (MAPPO), initialized from the offline policy and fine-tuned in real time, and
(c) comprehensive evaluation of COHORT on vision-language
model (VLM) inference tasks such as CLIP and SAM, analyzing scalability with increasing robot/workload and robustness
under . We benchmark COHORT against genetic algorithms
and multiple RL baselines. Experimental results demonstrate
that COHORT reduces battery consumption by 15.4%, and
increases GPU utilization by 51.67%, while satisfying framerate and deadline constraints 2.55x of the time.
|
|
|
A Survey on Efficient Vision-Language Models
Gaurav Shinde, Anuradha Ravi, Emon Dey,
Shadman Sakib, Milind Rampure, Nirmalya Roy
WIREs DMKD, 2025
paper
Vision-language models (VLMs) integrate visual and
textual information, enabling a wide range of
applications such as image captioning and visual
question answering, making them crucial for modern AI
systems. However, their high computational demands pose
challenges for real-time applications. This has led to a
growing focus on developing efficient vision language
models. In this survey, we review key techniques for
optimizing VLMs on edge and resource-constrained
devices. We also explore compact VLM architectures,
frameworks and provide detailed insightsinto the
performance-memory tradeoffs of efficient VLMs.
|
|
|
E2RespUNet: End-to-End Respiratory Signal
Reconstruction and Rate Prediction
Using a Unified Attention-Enhanced U-Net
Shadman Sakib,
Gaurav Shinde,
Emon Dey, Nirmalya Roy
IEEE SmartComp, 2025
paper
Continuous, non-invasive respiratory rate (RR) monitoring is essential for the early diagnosis of many medical problems. However, conventional contact-based sensors frequently under-perform in dynamic situations that can be uncomfortable and require human intervention. To overcome these limitations, we propose E2RespUNet, an end-to-end system that uses multimodal video data to estimate breathing rates and reconstruct respiratory signals using an attention-enhanced U-Net architecture. Our method combines optical flow analysis with preprocessing, detrending, and normalization to reliably extract chest motion features in a variety of settings. According to our study in both the temporal and frequency domains, E2RespUNet surpasses existing baseline models by lowering the mean absolute error by up to 21 % in the sleep dataset and by up to 28% in the in-house dataset.
|
|
|
CoOpTex: Multimodal Cooperative Perception and Task
Execution in Time-critical Distributed Autonomous systems
MS Anwar, Anuradha Ravi, Emon Dey,
Gaurav Shinde, Indrajeet Ghosh, Jade Freeman, Carl Busart, Andre Harrison, Nirmalya Roy
IEEE DECOSS-IoT, 2025
paper
Integrating multimodal data such as RGB and LiDAR from multiple views significantly increases computational and communication demands, which can be challenging for resource-constrained autonomous agents while meeting the time-critical deadlines required for various mission-critical applications. To address this challenge, we propose CoOpTex, a collaborative task execution framework designed for cooperative perception in distributed autonomous systems (DAS).
|
|
|
TAVIC-DAS: Task and Channel-Aware
Variable-Rate Image Compression for Distributed
Autonomous System
Gaurav Shinde, Anuradha Ravi, Emon Dey, Jared Lewis, Nirmalya Roy
IEEE PerCom Worshop, 2025
paper
In network-constrained environments, distributed multi-agent systems—such as UGVs and UAVs—must communicate effectively to support computationally demanding scene perception tasks like semantic and instance segmentation. These tasks are challenging because they require high accuracy even when using low-quality images, and the network limitations restrict the amount of data that can be transmitted between agents. To overcome the above challenges, we propose TAVIC-DAS to perform a task and channel-aware variable-rate image compression to enable distributed task execution and minimize communication latency by transmitting compressed images. TAVIC-DAS proposes a novel image compression and decompression framework (distributed across agents) that integrates channel parameters such as RSSI and data rate into a task-specific "semantic segmentation" DNN to generate masks representing the object of interest in the scene (ROI maps) by determining a high pixel density needed to represent objects of interest and low density to represents surrounding pixels within an image
|
|
|
De-SaTE: Denoising Self-attention Transformer
Encoders for Li-ion Battery Health Prognostics
Gaurav Shinde, Rohan Mohapatra, Pooja Krishan, Saptarshi Sengupta
IEEE BigData, 2023
paper
The usage of Lithium-ion (Li-ion) batteries has gained widespread popularity across various industries, from powering portable electronic devices to propelling electric vehicles and supporting energy storage systems. A central challenge in Li-ion battery reliability lies in accurately predicting their Remaining Useful Life (RUL), which is a critical measure for proactive maintenance and predictive analytics. This study presents a novel approach that harnesses the power of multiple denoising modules, each trained to address specific types of noise commonly encountered in battery data. Specifically, a denoising auto-encoder and a wavelet denoiser are used to generate encoded/decomposed representations, which are subsequently processed through dedicated self-attention transformer encoders. After extensive experimentation on NASA and CALCE data, a broad spectrum of health indicator values are estimated under a set of diverse noise patterns. The reported error metrics on these data are on par with or better than the state-of-the-art reported in recent literature.
|
Awards
>
Awarded NSF Graduate Research Fellowship [2023]
>
Recieved Special Research Distinction Award from UT
Austin [2023]
>
Winner of the Capital of Texas Undergraduate
Research Conference (CTURC) [2022]
>
Best Virtual Reality Hack @Hack The Northeast
[2021]
>
Best iOS App @Orion Hacks [2021]
>
First Award, Physical Science & Engineering,
Synopsys Technology Championship [2019]
>
Mu Alpha Theta Award for Excellence in Mathematics
[2019]
>
Honorable Mention, Computational Systems &
Analysis, California Science and Engineering Fair [2018]
>
First Award, Biological Science & Engineering,
Synopsys Technology Championship [2018]
>
Naval Science Award, United States Navy & Marine
Corps [2018]
>
Special Congressional Recognition - Congressional
App Challenge [2017]
>
AP National Scholar with Distinction, National
Merit Scholarship Commended [2019]
>
Recognition for Science Research, Mayor of
Cupertino [2018, 2019]
>
Inspire Award, Silicon Valley Regional Robotics
Competition [2017]
>
USA Computing Olympiad (USACO) Gold Level [2017]
Service
>
Conference reviewer for CoRL, ICML, RA-L (x5), IROS,
ICRA
>
Teaching Assistant for CS 343 Artificial Intelligence
[Spring 2023]
Course website:
https://rpl.cs.utexas.edu/cs343_spring2023/
>
Co-organizer, 12th Annual F1TENTH Racing Competition at
CPS-IoT Week 2023 [2022 - 2023]
Co-organized the 12th F1TENTH Racing Competition held
in May 2023 at CPS-IoT Week
>
AURA Texas Representative [2022 - Present]
Present research to members of Congress; promote UT
undergraduate research at other Texas universities
>
Research Ambassador at UT Austin [2021 -
Present]
Serve in a Q&A panel in bimonthly undergraduate
research events; promote undergraduate research
|
Website adapted from
Jon Barron
|