Curriculum Vitae
kulinseth@gmail.com · kulinseth.github.io
Profile
Engineering Manager with 4+ years and Software Engineer with 10 years of experience in Machine Learning & HPC (distributed inference, LLMs & diffusion models, server platforms) and Systems Engineering (hypervisor, GPU driver, compiler, DSP).
Professional Experience
- Manage team of 13 people focused on ML inference for Server platforms
- Designed and led the implementation of core Distributed inference engine on Apple Silicon for Private Cloud Compute initiative of Apple Intelligence
- Led the team to add GPU backend support to PyTorch, JAX and TensorFlow ML frameworks
- GPU backend development for ML frameworks using Metal Performance Shaders
- PyTorch MPS backend maintainer
- Designed a ML device interface by extending VMware's virtualized GPU API
- Contributed to virtualization of Data Streaming Accelerator device with Intel
- Developed virtualized 3D graphics driver using Apple's Metal API for VMware Fusion product
- Developed and maintained OpenGL shading language compiler frontend and its interface with graphics driver and compiler backend
- Contributed to design and re-architecture of 3D driver code-base for Vulkan
- Achieved significant power savings by developing framebuffer rotation technique for 3D driver and later helped move design to HW
Data cache prefetching using stream buffers for next-gen Blackfin architecture. Work presented at Embedded Systems Conference.
- Trace-driven simulations and detailed memory modeling
- Compiler guided optimization via profiling and instrumentation
- Architectural core simulation using M5 simulator (developed Blackfin port)
- Developed generic Verification Test Generator for post-silicon validation on SHARC processors
- Porting of real-time Ogg Vorbis Decoder on SHARC processors
Technical Skills
Languages & API
- C/C++, CUDA, Python
- Assembly, Metal, OpenGL
Frameworks
- PyTorch (MPS backend maintainer)
- JAX, MLIR, Triton, LLVM
HPC
- NVSHMEM, NCCL, RDMA
- MPI, Slurm, Horovod
Publications & Patents
Patent: Kulin Seth, Apple Inc., "Distributed inference engine" US20250384312A1
Paper: Chong Wang, Nan Du, Tom Gunter, Tao Lei, Kulin Seth, Senyu Tong, Jianyu Wang, Guoli Yin, Xiyou Zhou, Kelvin Zou, Ruoming Pang, "Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization"
Technical Note: K. Seth and J. Jegadeesan, "Implementing an Ogg Vorbis Decoder on SHARC Processors", Analog Devices Inc., Engineer-to-Engineer Note 320, June 2008.
Research Experience
- Developed heterogeneous platform simulator (ARM + embedded GPU) for design-space exploration
- Studied correlation between LLVM compiler optimizations and micro-architectural details
- Benchmarks: OpenCL SDKs (NVIDIA & AMD), HPEC
- Improved performance using architectural features like DMA chaining and compiler optimizations
- Implemented efficient code and memory layout techniques for real-time decoding
Key Projects & Courses
Reviewed papers at the intersection of Computer Architecture and Machine Learning. Implemented simple LeNet inference network using Spatial DSL.
Transfer learning effectiveness in the medical domain. Reviewed neural network model compression techniques for transfer learning to embedded GPUs.
Improved performance of ML pipeline by 20x for prostate cancer classification on HPC cluster using PyTorch and Horovod. Implemented CNN pipelines for classification and segmentation of cancer pathology slides.
Decomposing neural networks using perturbation analysis of 3D synthetic datasets (SceneNet, ModelNet). Developed mathematical framework using influence functions to study effect of perturbations.
Machine learning techniques for breast cancer detection from pathology slides.
Education
Research Advisor: Prof. David Kaeli · GPA: 3.75/4.0
Academic Advisor: Prof. Sumam David · GPA: 8.53/10.0
Awards
Northeastern University 2009–2011 — Stipended Masters Graduate Research Assistantship
Qualcomm Inc. — Various QualStars for different contributions
Analog Devices Inc. 2008 — ADI Employee Performance Award
Other Projects & Experience
Range compression implementation for QLP/NRTP SAR processors on ADSP TigerSHARC processors. Processing pipeline: FFT of pulse return, FFT of reference function, complex multiplication, inverse FFT — distributed across multiprocessing environment with timing synchronization.
Cloud tracking using image processing — scale-space classification on satellite data with noise removal via cumulative-histogram threshold classification. Applied to Mumbai's heavy downpour data (July 2005).
Compiler Project (Fall 2009) — Developed all compilation stages; extended with additional language features.
Lossless Image Compression (Spring 2007) — Wavelet transform in integer domain on BF-535 fixed point processor.
Singing Voice Classification (Spring 2006) — MFCCs with DWT as feature vectors, SVMs with RBF kernel in MATLAB.
LZW Compression (Spring 2005) — Compression and decompression implemented in ARM assembly.