Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam) https://ift.tt/oqORCJH

Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam) Hi HN. I'm Ken, a 20-year-old Stanford CS student. I built Sup AI. I started working on this because no single AI model is right all the time, but their errors don’t strongly correlate. In other words, models often make unique mistakes relative to other models. So I run multiple models in parallel and synthesize the outputs by weighting segments based on confidence. Low entropy in the output token probability distributions correlates with accuracy. High entropy is often where hallucinations begin. My dad Scott (AI Research Scientist at TRI) is my research partner on this. He sends me papers at all hours, we argue about whether they actually apply and what modifications make sense, and then I build and test things. The entropy-weighting approach came out of one of those conversations. In our eval on Humanity's Last Exam, Sup scored 52.15%. The best individual model in the same evaluation run got 44.74%. The relative gap is statistically significant (p < 0.001). Methodology, eval code, data, and raw results: - https://sup.ai/research/hle-white-paper-jan-9-2026 - https://github.com/supaihq/hle Limitations: - We evaluated 1,369 of the 2,500 HLE questions (details in the above links) - Not all APIs expose token logprobs; we use several methods to estimate confidence when they don't We tried offering free access and it got abused so badly it nearly killed us. Right now the sustainable option is a $5 starter credit with card verification (no auto-charge). If you don't want to sign up, drop a prompt in the comments and I'll run it myself and post the result. Try it at https://sup.ai . My dad Scott (@scottmu) is in the thread too. Would love blunt feedback, especially where this really works for you and where it falls short. Here's a short demo video: https://www.youtube.com/watch?v=DRcns0rRhsg https://sup.ai March 26, 2026 at 04:45PM

Comments

Popular posts from this blog

Complete Guide to E-Commerce Business: Meaning, Models, and How to Start

Micro Niches: The Secret Weapon for SaaS Startups Struggling to Gain Traction

"From Micro Niche to Money Maker: How I Validated My E-Commerce Idea with AI (No Budget Needed)" Published: September 23, 2025 Keywords: Micro niche, AI validation, e-commerce, free tools, startup strategy Introduction Ever wondered if your e-commerce idea is worth pursuing? In this post, I’ll walk you through how I used free AI tools to validate a micro niche, build a lean store, and test demand—without spending a dime. If you’re stuck between ideas or afraid of wasting time and money, this guide is your shortcut to clarity. Step-by-Step Breakdown 1. Finding the Micro Niche Used ChatGPT to brainstorm underserved product categories. Cross-referenced with Google Trends and AnswerThePublic to check search interest. 2. Validating Demand Leveraged Perplexity AI to analyze competitors and market gaps. Ran polls using Typeform and Twitter/X to gauge interest. 3. Building the Store Created a free storefront using Shopify Starter and Canva for branding. Used Durable.co to generate landing page copy in minutes. 4. Driving Traffic Scheduled posts with Buffer across Instagram, Threads, and LinkedIn. Used Notion AI to draft blog content and email sequences. 5. Tracking Results Monitored engagement with Google Analytics and Hotjar. Adjusted product positioning based on feedback from Tally Forms. Key Takeaways Micro niches are goldmines when paired with smart AI validation. You don’t need a budget—just the right tools and strategy. Testing before investing saves time, money, and frustration. Thinking of launching your own store? Drop your niche idea in the comments and I’ll help you validate it with AI—free of charge!