About Contezy

Contezy engineers scalable web experiences and AI-driven systems that power automation and critical decision-making at scale. Our focus is on reproducibility, cost-effective performance, and the secure, production-grade deployment of machine learning infrastructure.

Role Summary

The AI / ML Developer will be responsible for architecting and maintaining our proprietary self-hosted LLM systems. This includes developing solutions for retrieval-augmented generation (RAG), custom task-specific assistants, knowledge indexing, and optimizing real-time inference. You will manage the ML lifecycle across model selection, high-quality dataset engineering, fine-tuning, deployment, and performance monitoring pipelines.

Key Responsibilities

  • Systematically select, benchmark, and evaluate LLMs based on critical performance, latency, and cost trade-offs for production use.
  • Fine-tune and adapt foundational models using techniques like Supervised Fine-Tuning (SFT), LoRA, or PEFT with curated, clean datasets.
  • Design and implement sophisticated Retrieval-Augmented Generation (RAG) pipelines utilizing vector stores, embedding models, and efficient indexing workflows.
  • Develop scalable model-serving APIs and inference systems, including managing multi-GPU configurations, serving quantized models, and implementing batching strategies.
  • Containerize models and deploy end-to-end ML workflows using Docker and Kubernetes within CI/CD pipelines.
  • Optimize model inference performance using advanced techniques like quantization, ONNX conversion, and specialized accelerated runtimes (e.g., Triton, TensorRT).
  • Instrument comprehensive observability and performance metrics, including detailed latency, throughput, and operational cost monitoring.
  • Collaborate with cross-functional engineering teams to seamlessly integrate LLM services into core production systems.

Required Qualifications

  • 2+ years professional experience in AI/ML engineering, with substantial, hands-on experience in LLM deployment and serving.
  • Expertise in Python, PyTorch, and developing robust, production-ready ML pipelines.
  • Direct experience with self-hosting LLMs, model serving frameworks (e.g., vLLM, Text-Generation-Inference), and advanced GPU optimization.
  • Strong proficiency in containerization (Docker/Kubernetes) and developing high-performance backend APIs (FastAPI/Flask).
  • Knowledge of vector databases (e.g., FAISS, Milvus, Pinecone) and effective retrieval strategies for RAG systems.
  • Familiarity with quantization methods, LoRA fine-tuning, and deployment optimization strategies aimed at cost efficiency.

Preferred Skills

  • Extensive experience with the Hugging Face ecosystem (Transformers, Accelerate) and orchestration tools like LangChain or LlamaIndex.
  • In-depth knowledge of open foundational models (Llama, Mistral, Falcon, Mixtral, etc.) and quantized inference frameworks (GGUF, bitsandbytes, llama.cpp).
  • Exposure to advanced MLOps practices, model observability platforms, and establishing reproducible training and deployment workflows.
  • Documented experience fine-tuning or comprehensively benchmarking foundation models on proprietary custom datasets.
Begin Application