Elevating Talents, Advancing Careers.
All roles are project-based and supervised. Browse our open positions below.
Compare AI-generated responses and select the best based on helpfulness, accuracy, and safety rubrics. Your rankings directly shape how frontier language models learn.
Score and audit AI responses across multiple quality dimensions, writing clear feedback that drives model improvement.
Test whether AI models correctly execute complex, multi-step instructions by checking every specified constraint independently.
Engage AI models in realistic multi-turn conversations and assess coherence, context retention, and response quality across the full dialogue.
Verify factual claims in AI outputs against authoritative sources, identifying and categorizing hallucinations with expert precision.
Evaluate extended AI outputs including essays, reports, and summaries for coherence, structure, accuracy, and stylistic quality across thousands of words.
Evaluate AI systems that generate or reason across text, images, and other modalities ; assessing cross-modal consistency and reasoning quality.
Produce precise bounding boxes, semantic labels, and classification tags across diverse image datasets for computer vision model training.
Annotate objects, actions, and events across video sequences with consistent tracking, temporal boundaries, and activity labels.
Produce pixel-precise segmentation masks that define the exact boundaries of every object in an image ; the most technically demanding vision annotation task.
Annotate 3D LiDAR point cloud data with cuboids, segmentation labels, and tracking IDs for autonomous vehicle and robotics AI.
Annotate clinical images including CT, MRI, X-ray, and pathology slides under clinical supervision for medical AI training.
Classify land use, detect infrastructure, and annotate change detection in satellite and aerial imagery for geospatial AI.
Place precise skeleton keypoints on human bodies, faces, hands, and animals to train pose estimation and motion AI.
Transcribe spoken audio into accurate text across languages, accents, and challenging acoustic conditions.
Evaluate synthesized speech for naturalness, prosody, emotional authenticity, and intelligibility using structured scoring rubrics.
Annotate who speaks when in multi-speaker recordings, precisely marking speaker turns, overlaps, and re-entries.
Listen to audio and identify dialect, accent, and regional speech patterns with precision for inclusive speech AI development.
Label environmental sounds, acoustic scenes, and sound events in audio recordings for audio AI model training.
Apply precise category labels to text across complex, multi-level taxonomies for NLP model training at scale.
Tag entities ; people, organizations, locations, dates, products, and custom types ; in text with precise span boundaries.
Craft natural, challenging, and diverse Q&A pairs across domains for reading comprehension and knowledge AI training datasets.
Evaluate AI-generated summaries for factual faithfulness, completeness, fluency, and appropriate conciseness.
Annotate user utterances with intent labels and extracted slot values for conversational AI and virtual assistant training.
Evaluate AI reasoning in biology, ecology, genetics, microbiology, and biochemistry ; verifying accuracy and flagging scientific errors.
Evaluate AI outputs in organic, inorganic, physical, and computational chemistry for correctness and scientific rigor.
Review AI physics explanations and problem-solving across classical mechanics, electromagnetism, quantum mechanics, and relativity.
Clinician-led evaluation of AI medical outputs for clinical accuracy, appropriate safety caveating, and alignment with current guidelines.
Evaluate AI outputs in pharmacology, toxicology, and drug safety for accuracy, contraindication awareness, and regulatory compliance.
Evaluate AI content in climate science, ecology, geology, oceanography, and environmental policy for scientific accuracy.
Attorney review of AI legal outputs for jurisdictional accuracy, reasoning quality, and appropriate professional caveating.
Professional evaluation of AI financial and economic outputs for numerical accuracy, reasoning quality, and regulatory compliance.
Evaluate AI psychology outputs for clinical accuracy, ethical appropriateness, and alignment with current research and practice.
Evaluate AI educational content for pedagogical quality, age-appropriateness, accuracy, and alignment with learning science.
Verify AI accounting outputs for GAAP/IFRS compliance, numerical accuracy, and sound financial reporting reasoning.
Verify AI mathematical solutions and proofs step by step ; from algebra to graduate-level mathematics.
Evaluate AI-generated code for correctness, security vulnerabilities, performance, and engineering best practices across multiple languages.
Evaluate AI data science outputs for statistical validity, methodology quality, and correct interpretation of analytical results.
Evaluate AI cybersecurity outputs for technical accuracy, current threat landscape alignment, and appropriate security guidance.
Evaluate AI engineering outputs for technical correctness, code compliance, and sound engineering reasoning across disciplines.
Deliberately probe AI models to find vulnerabilities, jailbreaks, and failure modes under strict ethical boundaries ; making AI safer before it reaches the public.
Classify AI outputs against safety policies across harm categories, applying consistent judgment and escalation protocols.
Systematically test AI models for demographic bias and unequal treatment across race, gender, religion, and other protected characteristics.
Apply fine-grained toxicity labels and severity scores across hate speech, harassment, threats, and harmful instructional content.
Identify, categorize, and verify claims in AI outputs for factual accuracy and potential for harmful misinformation spread.
Apply phonetic, grammatical, and pragmatic annotations that teach AI the deep structure of human language.
Translate and culturally adapt AI training data so that multilingual models genuinely understand non-English languages.
Listen to audio and identify regional dialects and accent features with precision for inclusive speech AI development.
Classify emotional tone, sentiment polarity, and pragmatic intent in text and audio ; teaching AI to understand feeling, not just words.
Annotate morphological structure and syntactic relations in text ; fundamental data for low-resource and morphologically rich language AI.
Design expert-level prompts that push AI reasoning to its limits and systematically improve model capabilities through structured experimentation.
Evaluate autonomous AI agents performing multi-step tasks with external tools ; one of the most cutting-edge roles in AI capability research.
Build and maintain Python scripts and tools that support annotation pipelines, data processing, and quality assurance workflows for AI training.
Design, deploy, and maintain the infrastructure that powers AI training data pipelines ; from annotation tooling to quality monitoring dashboards.
Audit AI training datasets for errors, duplicates, and quality issues before they enter model training ; a high-leverage quality gate.
Evaluate LLM-generated synthetic training data for quality, diversity, and factual consistency before it enters training pipelines.
Evaluate AI systems that reason simultaneously across text, images, audio, and video ; among the most advanced evaluation roles in the field today.
Test AI agents that act autonomously in complex environments ; identifying failure modes, planning errors, and safety risks before deployment.
Evaluate the output quality, safety, and prompt adherence of generative AI models producing text, images, code, and other creative content.
Annotate and evaluate training data for robots and embodied AI systems ; including physical task instructions, manipulation feedback, and spatial reasoning datasets.
Annotate 3D environments, spatial interactions, and extended reality content for training AI systems that operate in AR and VR spaces.
Build the interfaces that annotation teams and quality leads depend on ; fast, accessible, and built for precision work.
Architect the server-side systems that route, validate, and power every annotation pipeline Stalwart operates.
Design the annotation tools and client dashboards that must be simultaneously beautiful, cognitively efficient, and enterprise-grade.
Own features end-to-end ; from the annotator interface to the API to the database ; in a product that directly shapes frontier AI.
Design, test, and refine the prompts that govern frontier AI model behavior ; where language precision and engineering rigor meet.
Rigor and the ability to follow detailed instructions are mandatory. We provide structured onboarding, performance-guided supervision, and a clear path to senior roles.
Not seeing your perfect role? Send in your application here:
Submit Open Application →