Profile

Engineering Manager with 4+ years and Software Engineer with 10 years of experience in Machine Learning & HPC (distributed inference, LLMs & diffusion models, server platforms) and Systems Engineering (hypervisor, GPU driver, compiler, DSP).

Professional Experience

Apple Inc.
Engineering Manager — Server ML Frameworks team
Feb 2022 — Present
  • Manage team of 13 people focused on ML inference for Server platforms
  • Designed and led the implementation of core Distributed inference engine on Apple Silicon for Private Cloud Compute initiative of Apple Intelligence
  • Led the team to add GPU backend support to PyTorch, JAX and TensorFlow ML frameworks
Apple Inc.
Software Engineer — Metal Performance Shaders team
Oct 2019 — Feb 2022
  • GPU backend development for ML frameworks using Metal Performance Shaders
  • PyTorch MPS backend maintainer
VMware Inc.
Sr. Member of Technical Staff — Hypervisor team
Aug 2016 — Oct 2019
  • Designed a ML device interface by extending VMware's virtualized GPU API
  • Contributed to virtualization of Data Streaming Accelerator device with Intel
  • Developed virtualized 3D graphics driver using Apple's Metal API for VMware Fusion product
Qualcomm Technologies Inc.
Sr. Software Engineer — Graphics driver team
Feb 2011 — Aug 2016
  • Developed and maintained OpenGL shading language compiler frontend and its interface with graphics driver and compiler backend
  • Contributed to design and re-architecture of 3D driver code-base for Vulkan
  • Achieved significant power savings by developing framebuffer rotation technique for 3D driver and later helped move design to HW
Analog Devices Inc.
Co-op / Industry Project
Jan 2009 — Aug 2009

Data cache prefetching using stream buffers for next-gen Blackfin architecture. Work presented at Embedded Systems Conference.

  • Trace-driven simulations and detailed memory modeling
  • Compiler guided optimization via profiling and instrumentation
  • Architectural core simulation using M5 simulator (developed Blackfin port)
Analog Devices Inc.
DSP Applications Engineer
Jun 2007 — Jun 2008
  • Developed generic Verification Test Generator for post-silicon validation on SHARC processors
  • Porting of real-time Ogg Vorbis Decoder on SHARC processors

Technical Skills

Languages & API

  • C/C++, CUDA, Python
  • Assembly, Metal, OpenGL

Frameworks

  • PyTorch (MPS backend maintainer)
  • JAX, MLIR, Triton, LLVM

HPC

  • NVSHMEM, NCCL, RDMA
  • MPI, Slurm, Horovod

Publications & Patents

Patent: Kulin Seth, Apple Inc., "Distributed inference engine" US20250384312A1

Paper: Chong Wang, Nan Du, Tom Gunter, Tao Lei, Kulin Seth, Senyu Tong, Jianyu Wang, Guoli Yin, Xiyou Zhou, Kelvin Zou, Ruoming Pang, "Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization"

Technical Note: K. Seth and J. Jegadeesan, "Implementing an Ogg Vorbis Decoder on SHARC Processors", Analog Devices Inc., Engineer-to-Engineer Note 320, June 2008.

Research Experience

OpenCL: Programming Framework for Heterogeneous Embedded Platforms
Jan 2010 — Feb 2011
  • Developed heterogeneous platform simulator (ARM + embedded GPU) for design-space exploration
  • Studied correlation between LLVM compiler optimizations and micro-architectural details
  • Benchmarks: OpenCL SDKs (NVIDIA & AMD), HPEC
Real-time Ogg Vorbis Decoder on SHARC Processors
Industry Project
May 2009 — May 2010
  • Improved performance using architectural features like DMA chaining and compiler optimizations
  • Implemented efficient code and memory layout techniques for real-time decoding

Key Projects & Courses

HW Accelerators for Machine Learning (Stanford CS 217)
Course
Jan 2020 — Apr 2020

Reviewed papers at the intersection of Computer Architecture and Machine Learning. Implemented simple LeNet inference network using Spatial DSL.

Advances in Deep Learning (NEU 7398)
Course Project
Jan 2019 — May 2019

Transfer learning effectiveness in the medical domain. Reviewed neural network model compression techniques for transfer learning to embedded GPUs.

Paige, AI
ML Engineering
Aug 2018 — Nov 2018

Improved performance of ML pipeline by 20x for prostate cancer classification on HPC cluster using PyTorch and Horovod. Implemented CNN pipelines for classification and segmentation of cancer pathology slides.

Network Introspection & Perturbation Analysis
Research
Jun 2018 — Present

Decomposing neural networks using perturbation analysis of 3D synthetic datasets (SceneNet, ModelNet). Developed mathematical framework using influence functions to study effect of perturbations.

Machine Learning (MIT 6.867)
Course Project
Sept 2017 — Dec 2017

Machine learning techniques for breast cancer detection from pathology slides.

Education

Northeastern University
M.S. Computer Engineering (Thesis)
Sep 2008 — May 2011

Research Advisor: Prof. David Kaeli · GPA: 3.75/4.0

National Institute of Technology, Surathkal
B.Tech. Electronics & Communications Engineering
Sep 2003 — May 2007

Academic Advisor: Prof. Sumam David · GPA: 8.53/10.0

Awards

Northeastern University 2009–2011 — Stipended Masters Graduate Research Assistantship

Qualcomm Inc. — Various QualStars for different contributions

Analog Devices Inc. 2008 — ADI Employee Performance Award

Other Projects & Experience

Indian Space Research Organization
Intern
May 2006 — Jul 2006

Range compression implementation for QLP/NRTP SAR processors on ADSP TigerSHARC processors. Processing pipeline: FFT of pulse return, FFT of reference function, complex multiplication, inverse FFT — distributed across multiprocessing environment with timing synchronization.

Indian Institute of Science
Summer Research
May 2005 — Jul 2005

Cloud tracking using image processing — scale-space classification on satellite data with noise removal via cumulative-histogram threshold classification. Applied to Mumbai's heavy downpour data (July 2005).

Compiler Project (Fall 2009) — Developed all compilation stages; extended with additional language features.

Lossless Image Compression (Spring 2007) — Wavelet transform in integer domain on BF-535 fixed point processor.

Singing Voice Classification (Spring 2006) — MFCCs with DWT as feature vectors, SVMs with RBF kernel in MATLAB.

LZW Compression (Spring 2005) — Compression and decompression implemented in ARM assembly.