Jun Yu Tan

Machine Learning Engineer6sense

Hello! đź‘‹

I am Jun Yu Tan (my first name is Jun Yu), currently a Machine Learning Engineer at Saleswhale (acquired by 6sense) building conversational AI to power automated lead conversion in sales and marketing teams. Since August 2023, I’m also pursuing an M.S. in Computer Science at Georgia Tech.

I graduated from the National University of Singapore with a B.Sc. in Data Science & Analytics, and I’m passionate about building impactful AI experiences to ultimately help make our world more efficient and creative. These days, I place more focus on scalable ML systems, as well as exploring how we can make LLMs more reliable and steerable in AI-enabled products. I also have a strong interest in entrepreneurship and would love to embark on this journey one day.

Previously, I also worked on data science and machine learning projects with eyos, Autodesk, and Micron. During my time at Micron, I performed research on novelty detection for predictive maintenance of manufacturing equipment using acoustic IoT systems.


[2023-10-11] Gave a talk at Google Developer Space about techniques to enhance reliability of LLMs in production.

[2023-08-21] Revamped this blog with a more polished theme.


Rethinking Generation & Reasoning Evaluation in Dialogue AI Systems

As Large Language Models (LLMs) gain mass adoption and excitement, there is no shortage of benchmarks within the LLM community; benchmarks like HellaSwag tests for commonsense inference via sentence completion, while TruthfulQA seeks to measure a model’s tendency to reproduce common falsehoods. On the other hand, natural language generation (NLG) evaluation metrics for dialogue systems like ADEM, RUBER, and BERTScore try to capture the appropriateness of responses in mimicking the scoring patterns of human annotators1. But as we rely further on (and reap the benefits of) LLMs’ reasoning abilities in AI systems and products, how can we still grasp a sense of how LLMs “think”? Where steerability is concerned — users or developers may desire to add in custom handling logic and instructions — how can ensure that these models continue to follow and reason from these instructions towards a desirable output? There is a sense that verifying the instruction-following thought patterns of these dialogue generations seems to go beyond word overlaps, sentence embeddings, and task-specific benchmarks. Let’s think beyond LLMs and instead reframe evaluations on an AI system (or agent) level, and examine from first principles on what such a system should and should not do.
11 min read

Concepts for Reliability of LLMs in Production

Traditional NLP models are trainable, deterministic, and for some of them, explainable. When we encounter an erroneous prediction that affects downstream tasks, we can trace it back to the model, rerun the inference step, and reproduce the same result. We obtain valuable information like confidences (prediction probabilities) as a measure of the model’s ability to perform the task given the inputs (instead of silently hallucinating), and retrain it to patch its understanding of the problem space. By replacing them with large language models (LLMs), we trade the controllability of machine learning (ML) systems for their flexibility, generalizability, and ease of use.
11 min read