# Introduction I am a Staff Software Engineer and Researcher at Meta, specializing in large-scale distributed systems and applied AI. I am passionate about building reliable, scalable, and intelligent infrastructure that powers the next generation of agentic workflows. With deep expertise spanning large-scale distributed systems, agentic infrastructure, systems architecture, and operational resilience, I focus on solving the hardest problems at the intersection of systems, AI, and real-world execution where theory meets engineering tradeoffs. Within Meta SuperIntelligence Lab (MSL), I have contributed to building agentic infrastructure - systems where AI agents operate within structured distributed environments, interacting with monitoring, scheduling, and feedback loops. My work in this space focuses on: - Evaluation and auditing of AI-driven decision systems in high-stakes production environments - Reliability, safety, and human oversight in autonomous and semi-autonomous systems - Designing feedback mechanisms to align system behavior with user and operational goals - Measuring real-world impact beyond offline metrics # Building Elastic Compute Infrastructure at Meta I also built the next-generation of elastic compute infrastructure to increase overall fleet utilization responsible for managing ~30% of Meta’s capacity (tens of millions of servers) across ~20 geo-distributed datacenter saving billions of dollars in Capex. This also involved partnering with VPs across Ads/Whatsapp/IG/Finance/Infra to set multi-year roadmap and strategy for increasing fleet-wide efficiency. # Research At Meta, my recent research includes Dynamic Idle Resource Leasing to Safely Oversubscribe Capacity at Scale, where I designed and deployed a production system that improves datacenter utilization by leasing idle capacity while preserving reliability and strict SLO guarantees. This work required building rigorous evaluation frameworks spanning simulation, controlled experimentation, and real-world safety validation - balancing algorithmic optimization with operational risk. The system has delivered measurable infrastructure-efficiency gains at production scale. I have also authored papers with 90+ citations. # What I care about I think deeply about how distributed services communicate, self-coordinate, and act with reliability under ambiguity. My work is rooted in understanding latency, correctness, failure modes, and semantic interoperability - not just performance on benchmarks, but real-world outcomes that matter in production. I’ve led teams and initiatives that: - Architect complex distributed platforms that serve high-availability workloads at scale - Design agentic systems and frameworks that enable coordinated autonomous behavior across services and models - Build operationally robust infrastructure with strong observability, fault tolerance, and graceful degradation - Translate cutting-edge research into developer-ready systems and patterns # Education I graduated from University of California, Los Angeles (UCLA) with a Master's in Computer Science in December 2019. At UCLA, my focus area was on building scalable distributed systems leveraging Machine Learning. # Ways to collaborate: • Keynotes, conference talks, and technical workshops • Partnerships with AI platforms, developer tools, and education organizations • Advisory and consulting on AI infrastructure and large-scale systems For speaking, partnerships, or advisory inquiries: nishantgupta@g.ucla.edu
Nishant Gupta is a Staff Software Engineer and Researcher at Meta who works on distributed inference systems and compute infrastructure at scale. He co-founded BuzzingTech.ai with Naman Ahuja to teach engineers AI-native and distributed-systems skills.
Public activity researched automatically · as of Jun 2026