How Do We Make Genie Reliable Enough?

Databricks’ Genie offering has enabled organizations to bring in much more of their personnel into data
discussions than ever before with an easy-to-use chat with your data experience. Tasks that used to
require an analytics team can be completed by any member of the organization, subsequently freeing up
that analytics team to focus on differential value.

We recently deployed a Genie pilot at a large company and feedback was initially extremely positive, but
we started running into some challenging hurdles once we dug deeper. This post will cover some of our
responses to ensuring high quality and reliable responses in an inherently non-deterministic medium.

Failures After the Pilot

Large-language models (LLMs) already do an excellent job of semantic “understanding.” This leads to
some low-hanging fruit for dazzling in demos. However, after our initial successes, we found a few key
problems:

As data and teams were added, everyone’s experience worsened. The same spaces were
leveraging inferred definitions of both similar and different things from different functional areas.
In one example: How many open and accepted orders are there? In theory, this should filter on
something like unfinished and accepted columns; however, the business meaning required
non-intuitive definitions.
Our customer had high-quality existing dashboards. Stakeholders would cross-reference Genie with
their dashboards, and we found that our initial Genie pilot would get wrong answers ~30% of the
time! If we can’t believe that the same questions will be answered the same ways, can we trust
anything?

These are just a few examples, but once we got past the easy demo-style questions, we were facing
substantial apprehension because of incorrect assumptions and poor information retrieval.

So, What Goes into Answering a Question?

Taking a step back from Databricks and from Genie, what goes into answering a question? We have
to understand what the asker is asking, we have to know where any supporting information might come
from, and finally, we have to sort semantic wheat from chaff for precision and completeness. Genie is
much the same and where we ran into problems is when we neglected enabling its success.

Fixing Your Genie Problems

After the philosophical detour on question answering, Genie reliability with some technical grounding and
good design. There are many steps for improvement, but here is an overarching checklist from our
experience:

Start with good data. Evergreen advice. The fewer assumptions, the better.
1. We worked with subject-matter experts (SMEs) to redefine cleaner gold tables. This included
  explicit calculation of some of those problematic KPIs from above to minimize wasted
  time/tokens and increase reliability. This also plays neatly into the table limit on Genie Spaces –
  make sure you’re only sharing useful information.
2. Next, we leveraged pre-joined data for complex concepts, following customer patterns, which
  immediately decreased off-target behavior; but we further enhanced this by explicitly defining
  behavior of popular joins via shared keys and mapping relationships.
Be good stewards. If you have strong definitions for your cataloged data, it will be inherently more
understandable.
1. For our customer, we leveraged a first pass of LLM-derived descriptions from names and data
  samples. This was already a huge step forward, but we moved forward with SME-written
  descriptions for all governed assets referenced by the Genie Spaces afterward.
2. Leverage metric views. Metric views are a newer Databricks offering for machine-readable
  definitions of agreed-upon concepts. For our customer, explicitly defining metrics removed
  all the uncertainty around calculation and meaning and dropped our failure rate to single
  digits.
  1. From the example above, Genie was now able to figure out what columns were needed for
    Open and Accepted orders because it was explicitly defined!
Capture and respond to feedback. Once you are up and running, you can codify response
dynamics with the feedback tools in the Genie experience.
1. Responses rated “Good” can be turned into example prompts to further ground future answers
  in the right spaces and responses rated “Bad” can be reevaluated asynchronously to discover
  underlying failure modes.

Looking Forward: Keeping the Good Responses Going

Like with all AI applications, we strongly recommend implementing evaluations to track progress and
catch regression. Genie offers a benchmarking feature where example questions and answers can be
used. For this customer, we took those dashboard vs. Genie discrepancies from above and codified
those into benchmarks. Daily and whenever more data was added or prompts changed, we re-ran our
benchmarks to make sure that we hadn’t lost the ability to answer the most important business
questions. This is pivotal because the success of Genie is reliant on user trust and buy-in.

In conclusion, Genie is democratizing data access and understanding for many organizations, but it is
pivotal that intentional steps are taken to prevent off-target behavior and low-quality answers. Especially
early on, these can be showstoppers that prevent additional forward progress.

How Do We Make Genie Reliable Enough?

Failures After the Pilot

So, What Goes into Answering a Question?

Fixing Your Genie Problems

Looking Forward: Keeping the Good Responses Going

Data Surge

Next PostWhere Do Databricks AI/BI Dashboards and Genie Live in Your Legacy BI Ecosystem?

Leave a Reply Cancel Reply

Get Started

What we do?

Industries