AI / ML Developer — LLM Deployment & Optimization

Design, deploy, and maintain robust self-hosted large language model systems optimized for enterprise production tasks, covering the full lifecycle from fine-tuning to low-latency inference.

Full-time • Remote / Hybrid Minimum 2 years professional experience Contezy — AI & Web Infrastructure

About Contezy

Contezy engineers scalable web experiences and AI-driven systems that power automation and critical decision-making at scale. Our focus is on reproducibility, cost-effective performance, and the secure, production-grade deployment of machine learning infrastructure.

Role Summary

The AI / ML Developer will be responsible for architecting and maintaining our proprietary self-hosted LLM systems. This includes developing solutions for retrieval-augmented generation (RAG), custom task-specific assistants, knowledge indexing, and optimizing real-time inference. You will manage the ML lifecycle across model selection, high-quality dataset engineering, fine-tuning, deployment, and performance monitoring pipelines.

Key Responsibilities

Systematically select, benchmark, and evaluate LLMs based on critical performance, latency, and cost trade-offs for production use.
Fine-tune and adapt foundational models using techniques like Supervised Fine-Tuning (SFT), LoRA, or PEFT with curated, clean datasets.
Design and implement sophisticated Retrieval-Augmented Generation (RAG) pipelines utilizing vector stores, embedding models, and efficient indexing workflows.
Develop scalable model-serving APIs and inference systems, including managing multi-GPU configurations, serving quantized models, and implementing batching strategies.
Containerize models and deploy end-to-end ML workflows using Docker and Kubernetes within CI/CD pipelines.
Optimize model inference performance using advanced techniques like quantization, ONNX conversion, and specialized accelerated runtimes (e.g., Triton, TensorRT).
Instrument comprehensive observability and performance metrics, including detailed latency, throughput, and operational cost monitoring.
Collaborate with cross-functional engineering teams to seamlessly integrate LLM services into core production systems.

Required Qualifications

2+ years professional experience in AI/ML engineering, with substantial, hands-on experience in LLM deployment and serving.
Expertise in Python, PyTorch, and developing robust, production-ready ML pipelines.
Direct experience with self-hosting LLMs, model serving frameworks (e.g., vLLM, Text-Generation-Inference), and advanced GPU optimization.
Strong proficiency in containerization (Docker/Kubernetes) and developing high-performance backend APIs (FastAPI/Flask).
Knowledge of vector databases (e.g., FAISS, Milvus, Pinecone) and effective retrieval strategies for RAG systems.
Familiarity with quantization methods, LoRA fine-tuning, and deployment optimization strategies aimed at cost efficiency.

Preferred Skills

Extensive experience with the Hugging Face ecosystem (Transformers, Accelerate) and orchestration tools like LangChain or LlamaIndex.
In-depth knowledge of open foundational models (Llama, Mistral, Falcon, Mixtral, etc.) and quantized inference frameworks (GGUF, bitsandbytes, llama.cpp).
Exposure to advanced MLOps practices, model observability platforms, and establishing reproducible training and deployment workflows.
Documented experience fine-tuning or comprehensively benchmarking foundation models on proprietary custom datasets.

Begin Application