Aarush Agarwal

Experiences

Felicis

Venture Fellow • January 2026 - Present

Selected as a Venture Fellow in a highly competitive program focused on leveraging AI and technology for real-world impact. Partnering with Felicis to identify, support, and accelerate early-stage student founders across campus.

Shopify

MLE Intern • May 2025 – Aug 2025

Fraud Detection: Developed a more robust buyer‑fraud detection system by implementing machine learning models with VertexAI and optimizing data pipelines with BigQuery and Dataflow. Improvements to data freshness and analysis increased predictive accuracy, while targeted feature selection significantly reduced training iteration time.

AI Agent Network: Built a distributed agent framework powered by Neo4j, where specialized AI agents collaborated through graph traversal queries. This enabled intelligent task decomposition and significantly improved the quality and efficiency of automated output. Patent filed.

Research

CMU Language Technologies Institute

Jan 2026 – Present

Researching dynamic Mixture-of-Experts (MoE) architectures under Chenyan Xiong, designed to seamlessly integrate private-domain knowledge into Large Language Models while preserving public capabilities. Leveraging memorization sinks and expert specialization, the work employs adaptive routing and mid-training expert duplication to autonomously expand model capacity when novel data arises. This strategy unifies decentralized training environments, such as FlexOlmo, with scalable MoE systems to drive continual learning, effectively balancing strong knowledge isolation with positive transfer.

CMU Cosmology Laboratory & CERN

CUDA Researcher • Aug 2024 – Aug 2025

Paper:

FastGraph: Optimized GPU-Enabled Algorithms for Fast Graph Building and Message Passing

Co-authored FastGraph, a GPU-optimized k-nearest neighbor algorithm that accelerates graph construction in low-dimensional spaces (2–10D) using a bin-partitioned, fully GPU-resident architecture with full gradient-flow support. FastGraph achieves a 20–40× speedup over FAISS, ANNOY, and SCANN with virtually no memory overhead, improving GNN workloads including particle clustering, visual tracking, and large-scale graph clustering.

Engineered PyTorch autograd and gradient operations in C++/CUDA and integrated JIT serialization, reducing KNN runtime by an additional 10% and enabling end-to-end differentiability inside GPU training pipelines.