Distyl Takes #1 Spot on BIRD Benchmark (Leading Text-to-SQL Benchmark)
This week, the Distyl research team took first place on the leading Text-to-SQL (NL2SQL) benchmark, BIRD. Our state-of-the-art Distillery platform + GPT-4o beat other industry giants (Google, ByteDance, IBM, Stanford etc), and became the first model to cross 70% accuracy.
Not only did we have the highest accuracy, the SQL we generated was also extremely efficient (second place).
Curious about our methodology? Stay tuned for our upcoming research paper! Here’s a sneak peek:
Our Biggest Breakthrough
Most efforts around NL2SQL have indexed highly on schema linking - as seen in content by Databricks, Snowflake, IBM, and Stanford when they topped previous NL2SQL leaderboards. Historically, reducing the noise going into LLMs was a leading way to improve results.
However, starting with GPT-4o, models can now handle the complexity of more complete schemas. We thus shifted focus from reducing noise to increasing the signal given to LLMs. This is a dramatically different approach from how most research has approached NL2SQL - and it has only recently become possible due to the significant improvements to LLMs.
For details on how we increased the signal we sent to the LLMs, check back here when we share the full paper!
Other Highlights
Our NL2SQL model also uses state-of-the-art techniques and unique methodologies for both self-evaluation and self-correction. These were also critical to our results, since each of these added significant robustness and accuracy to the system.
Other Considerations Around NL2SQL Adoption
In addition to needing a generally robust solution for NL2SQL creation, enterprise adoption requires additional work. Companies typically have a significant amount of internal knowledge that needs to be reflected in the SQL, or in the interpretation of the NL, or often both. Having an efficient way to find the relevant domain information and integrate that into the NL2SQL is a critical component of any solution. Fortunately, at Distyl we believe in working from use-cases backwards. This means that our NL2SQL solution is built with enterprise contexts in mind - so it can effectively incorporate internal knowledge on top of our leading NL2SQL model.
Why BIRD Matters
While there are other NL2SQL benchmarks, those have generally been considered “solved”. A benchmark is considered “solved” once scores cross ~90%. WikiSQL, an early NL2SQL benchmark, crossed 90% in about 2021. The next benchmark was Spider, which hit that threshold in 2023. BIRD is the newest NL2SQL benchmark, and is different from the previous ones because it introduces “slop” - the tables are messy, and the input queries are also intentionally not clean. In short, BIRD is both more representative of real-world datasets than previous benchmarks, and is currently the leading NL2SQL benchmark. Even with our improvements, there is space for another 20% of accuracy improvements before this benchmark is considered solved, so we’re excited to keep pushing this frontier forward!