Rethinking Generation & Reasoning Evaluation in Dialogue AI Systems

As Large Language Models (LLMs) gain mass adoption and excitement, there is no shortage of benchmarks within the LLM community; benchmarks like HellaSwag tests for commonsense inference via sentence completion, while TruthfulQA seeks to measure a model’s tendency to reproduce common falsehoods. On the other hand, natural language generation (NLG) evaluation metrics for dialogue systems like ADEM, RUBER, and BERTScore try to capture the appropriateness of responses in mimicking the scoring patterns of human annotators1. But as we rely further on (and reap the benefits of) LLMs’ reasoning abilities in AI systems and products, how can we still grasp a sense of how LLMs “think”? Where steerability is concerned — users or developers may desire to add in custom handling logic and instructions — how can ensure that these models continue to follow and reason from these instructions towards a desirable output? There is a sense that verifying the instruction-following thought patterns of these dialogue generations seems to go beyond word overlaps, sentence embeddings, and task-specific benchmarks. Let’s think beyond LLMs and instead reframe evaluations on an AI system (or agent) level, and examine from first principles on what such a system should and should not do.
11 min read

Concepts for Reliability of LLMs in Production

Traditional NLP models are trainable, deterministic, and for some of them, explainable. When we encounter an erroneous prediction that affects downstream tasks, we can trace it back to the model, rerun the inference step, and reproduce the same result. We obtain valuable information like confidences (prediction probabilities) as a measure of the model’s ability to perform the task given the inputs (instead of silently hallucinating), and retrain it to patch its understanding of the problem space. By replacing them with large language models (LLMs), we trade the controllability of machine learning (ML) systems for their flexibility, generalizability, and ease of use.
11 min read

Designing Human-in-the-Loop ML Systems

As machine learning practitioners, we constantly strive to produce the highest-performing models to achieve the best business outcomes. But model development is only the tip of the iceberg; how well an ML solution performs has to be continuously evaluated on live predictions. When using trained models, we subtly invoke an assumption: that the training data distribution sufficiently approximates the unseen data distribution. Unfortunately, though, this does not always hold.
11 min read

Learning Bayesian Hierarchical Modeling from 8 Schools

A walkthrough of a classical Bayesian problem.
13 min read

Understanding Copulas

In statistics, copulas are functions that allow us to define a multivariate distribution by specifying their univariate marginals and interdependencies separately. In modelling returns of assets, for example, this enables greater flexibility and ability to model joint behaviour in extreme events.
8 min read