Applied AI Engineer - LATAM

Share this job

Key Job Info

Location: LATAM (Fully Remote)
Employment Type: Full-Time
Experience Required: 3+ Years
Salary Range: $7,000 - $10,000 / per month

About the Company

We are a Pre-seed startup building agents (not the slow kind) for the global entertainment industry and the creator economy. Professionals at WME, UTA, Netflix, Night, and Live Nation use our platform today through its private beta. We have a massive trust moat: AvA, a 270K+ proprietary distribution network of verified entertainment professionals.

Less than six months after launching in stealth, we have already been featured in The Hollywood Reporter, Vanity Fair, and Variety. We are led by veteran founders with experience at UTA Ventures, WME, Live Nation, Hebbia AI, and Robinhood, alongside technical talent from NVIDIA, Intuit, and HubSpot. With advisors from CAA, Meta, Promise, Magic Leap, and Patreon, we oversubscribed our last $400K round (backed by angels from Coatue, Ramp, Plug and Play, FanFix, Temple Hill, Outshine Talent) and are now raising $2M with over $500K already committed.

About the Role

We are hiring an Applied AI Engineer dedicated to agent reliability. The mandate is deliberately narrow. The team does not need another full-stack engineer or a generic senior architect. We need someone whose hands stay in the eval pipeline and who turns the multi-agent system from "works in demos" into "survives non-deterministic LLM output at production scale."

You will partner with our technical co-founder on architecture decisions where reliability is at stake, including HITL escalation, thread context collapse, and multi-channel routing. Our team owns the long-term evals vision, with additional support migrating in over the summer. This hire builds and runs the pipeline day-to-day.

What You'll Own

Instrument the scheduling agent end-to-end with traces on top of the existing Langfuse deployment.
Build eval datasets from real production traffic across the agent layer (scheduling, notetaker, iMessage, Gmail).
Stand up the Braintrust scoring pipeline, including both quality scorers and robustness scorers (did the system survive the LLM output, did HITL trigger when it should).
Own the feedback loop from evals to prompt and architecture changes to re-evaluation, including DSPy and DPO as the system matures.
Partner with the technical co-founder on agent architecture decisions where reliability is at stake.
Operate end-to-end on your workstream: dev, test, deploy, and on-call sustain.

Requirements

Production Experience: Real production experience with non-deterministic LLM systems. Evidence of shipping, debugging, and maintaining agent systems under real traffic. Not side projects or OpenAI API experimentation.
Eval Intuition: Has built or operated an eval pipeline before. If not Braintrust specifically, the intuition for scorer design has to be there.
Product-First Mindset: Can take a scorer design and feedback loops and drive a real product forward, not just build models in isolation.
Backend Fluency: Strong Python backend fluency. The stack is FastAPI, Django, Celery, and LangGraph; the hire ships on that stack.
Observability: Comfortable with observability tooling (Langfuse or equivalent) and logs/metrics stacks like Grafana or Loki.
Ownership: Senior enough to own a workstream end-to-end from dev to test to deploy to on-call sustain. No pair programming crutch.
Communication: Works async, writes clearly, and is fluent in English.

Nice-to-Have

DSPy, DPO, or other LLM optimization experience.
Prior agentic systems work with LangGraph or similar (tool use, planning, multi-step).
Time at a prestige-tier LATAM company (Nubank, Itaú, iFood) or a strong YC/US startup.
Exposure to multi-model stacks (GPT-4.1/4o, Claude, Whisper) in production.

Interview Process

Initial Call
Technical Deep Dive + Take-Home
Final culture and commitment check
Offer Extended
Hired

Apply for this job

Website Contact Us Linkedin