Back to search:AI QA / Nashik
Job Summary\nWe are seeking an AI QA Engineer to ensure the quality, accuracy, and performance of our enterprise-grade Natural Language to SQL (NL2SQL) pipeline. You will be responsible for validating a complex, multi-stage AI architecture—including semantic routing, LLM-based disambiguation, and query generation—ensuring it securely and accurately translates user intent into valid queries within the BFSI domain.\nExperience: 7+ Years\nLocation: Gurugram\nWork Mode: Hybrid - 3 Days WFO\nEmployment Type: Full-Time\nKey Responsibilities\n- LLM & Pipeline Evaluation: Design and execute automated evaluations for a 4-stage NL2SQL pipeline using LangSmith. Monitor metrics such as structural F1, execution accuracy, latency, and token cost.- Dataset Management: Create, curate, and maintain benchmark/golden datasets for continuous regression testing of LLM prompts and model outputs.- Search & Retrieval Testing: Validate precision and recall trade-offs in semantic search and schema discovery, ensuring optimal candidate selection for downstream query generation.- Failure Analysis & Debugging: Perform root cause analysis across pipeline stages (routing, disambiguation, query generation, execution), identifying issues such as schema mismatches, type/coercion errors, runtime incompatibilities, and query structure failures.- E2E & API Automation: Develop automated test scripts using Python (Pytest) for backend API testing and Playwright for the React frontend, validating end-to-end user workflows.- Observability & Debugging: Utilize Grafana and structured JSONL logs to identify pipeline bottlenecks, LLM hallucinations, or prompt degradation.- Compliance & Security: Ensure the AI pipeline meets strict BFSI data security standards, validating execution safety mechanisms (e.g., runtime capability probing, injection prevention); Ability to design validation rules and guardrails for AI pipelines to prevent invalid query generation and runtime failures.Required Skills\n- AI/LLM Testing: Experience testing LLM applications, RAG (Retrieval-Augmented Generation) pipelines, or NLP models. Familiarity with AI evaluation frameworks (e.g., LangSmith, DeepEval, or similar).- Languages: Robust proficiency in Python 3.12+ (crucial for integrating with the existing AI backend and Pytest suite). Secondary experience with JavaScript/TypeScript.- Test Automation: Expertise in API testing (REST) and optional UI automation using Playwright.- Data & Search: Understanding of Vector Databases (e.g., Milvus, Pinecone) and semantic search concepts (embeddings, hybrid search).- Data & SQL Validation: Solid understanding of SQL and data validation techniques to verify correctness of complex query outputs.- Tools & Infrastructure: Git, Docker, CI/CD pipelines, and observability tools (Prometheus/Grafana).Education\n- BE / BTech / MCA / BSc in Computer Science, Data Science, or a related field.Nice to Have\n- Familiarity with Graph Databases (Neo4j) and LangGraph orchestration.- Experience evaluating foundational LLM models (OpenAI, Anthropic, Google).- Prior exposure to query languages like SQL or PURE or any other functional programming language.- Experience testing workflows across multiple services or pipelines, with an understanding of failure handling, retries, and system reliability concepts.- Experience in Banking, Financial Services, or Insurance domains- Understanding of data security, compliance, and enterprise database schemas

FoCookieConsentP1 FoCookieConsentLink FoCookieConsentP2