From RAGs to Riches: Improving the Accuracy of RAG Chatbots

This case study explores how Lime increased the number of vehicle repairs per hour by using Log10 to improve the accuracy of their RAG-based chatbot for mechanics.

Situation

Lime is the world’s largest shared electric vehicle company, with scooters and bikes available for rent in more than 280 cities in nearly 30 countries, on five continents.

Maintaining these vehicles requires a large team of service technicians who navigate thousands of pages of documentation to diagnose and repair issues. It’s really inefficient: it takes a lot of time, and sometimes technicians can’t find solutions to their issues.

As part of Lime’s investment in generative AI to increase business efficiencies, such as improving their Repair Per Hour (RPH) metric, company executives tasked Engineering with creating a RAG (Retrieval-Augmented Generation) chatbot that could assist service mechanics. The chatbot was designed to instantly provide answers from its vast database of documentation, plus translate content into 30 languages, on the spot.

“This task initially seemed like a breeze, thanks to the latest OpenAI models like the remarkably accurate GPT-4 and its turbo version,” said Chao Ma, Software Engineer at Lime.

The team developed a unique process called Refine/Retrieve/Rebuttal for their RAG flow that was designed to surmount practical challenges that they encountered, such as dealing with jokes, contextual integration, language detection, hallucinations, and more.

Here’s how it works: a question goes through the Refine and Retrieve stages before Inference, then goes through the Rebuttal stage before the answer is presented to the user.  A number of LLMs are used along the way to perform various functions:

  • Refine. An OpenAI GPT-4 model mitigates potential issues with the question, adds context, and extracts entities inside the question. 

  • Retrieve. This is a classic RAG, with the addition of an OpenAI GPT-4 model to enhance the quality of similarity and entity similarity.

  • Inference. The question and retrieved knowledge is sent to OpenAI GPT-4 for completion.

  • Rebuttal. An OpenAI GPT-4 model decides whether or not the completion is adequate, otherwise omits the answer.

Diagram of Lime’s RAG flow, showing stages and use of OpenAI GPT-4 LLMs.

Problem

Unfortunately, the process didn’t work nearly as well as expected. Chao got harsh feedback from the field: “Your bot is generating incorrect answers.” When mechanics typed in an error code, the chatbot would sometimes return information for a completely different code. The instructions for repairing the vehicles were often inaccurate.

Chao wondered, “Why is my chatbot hallucinating?!” It was impossible to diagnose what was happening and understand root causes using just general-purpose tooling like the terminal interface.

The Log10 Solution

Chao realized that he needed an observability stack to understand what was going on with his chatbot. Log10 offered an end-to-end LLM Ops platform that enabled teams to trace completions from foundational models and diagnose issues, and it was free for teams to get started. With one line of integration, he was able to start logging Open AI calls to the Log10 platform and begin debugging.

A quick search through the logs revealed why the chatbot was providing incorrect codes. “Mechanics were making typos that were invalid codes – and Retrieval was finding a similar error code and sending this to the OpenAI model,” Chao explained. Lime engineers added extra logic to the Refine stage: now the GPT-4 model makes sure that the error code from the question is valid, and if not, the chatbot asks the mechanic to check the code. “This cured the error code hallucinations,” Chao explained.

Using Logs, Lime engineers quickly discovered that mechanics were making typos in error codes which led to “hallucinations” during Inference when the OpenAI GPT-4 model took the liberty of using the closest matching code to answer the question.

Understanding why the instructions for repairing vehicles were so poor was a bit more challenging. The team started by asking their stakeholders to generate a set of questions along with the correct answers from particular documents. Then they asked the chatbot the questions and used Log10 to trace which documents were being sent to OpenAI. They quickly discovered the problem: The documents being sent were irrelevant, but why?

Log10 provides the ability to look at each and every completion in detail. Further inspection revealed that the RAG was ignoring critical details about the vehicle, such as its generation. For example, sometimes Gen 3 documents were being used to answer questions about Gen 4 vehicles. 

The team focused on refining the logic in the Refine Stage, asking the GPT-4 model to identify entities such as the vehicle type, generation, etc. while using Log10 to debug code generation issues and validate that the correct entities were passed. “Adding logic to define the entities totally fixed our accuracy issue,” explained Chao. 

The Lime team was able to trace completions and verify that the correct document was sent to Retrieval.

In addition to providing observability, evaluation, and debugging capabilities to fix chatbot hallucination issues, Log10 also provided the Lime team with other significant features:

  • Model comparison: Lime engineers compared results across models to determine the best model for language translation.

  • Model Usage and Cost Tracking: Lime engineers could understand cost metrics, including per-model usage reporting. 

Extending beyond the chatbot, Lime is now deploying LLMs across the enterprise to drive additional business efficiencies. One example is an operational incident dashboard that uses LLMs to continuously scan social media for vehicle incidents around the globe. Another is examining why revenue is changing unexpectedly. For instance, is an anomalous drop in Sunday revenue at a particular location related to bad weather, a major event keeping people home, or an operational issue such as deploying vehicles to the wrong location. 

As these types of LLM-based applications take on “mission-critical” roles in the business, the stakes get raised for accuracy and reliability. Lime is investing in AI capabilities such Log10’s AutoFeedback model, which labels every chatbot LLM completion with generated human feedback to enable advanced LLM Ops features such as:

  • Quality Monitoring and Alerts: Monitor quality of responses and generate alerts when quality falls below a defined threshold.

  • Error Detection and Prioritization: Detect LLM output accuracy issues and triage/rank for human review by Lime engineers.

  • Self-improving Applications: Continuously improve accuracy via prompt and model optimizations using datasets curated with human feedback.

Results

Since using Log10 to diagnose and debug their RAG chatbot, “our accuracy has improved considerably,” says Chao.

The company’s mechanics now have confidence in the chatbot, and the metrics are going up. “Now the chatbot provides excellent answers to our mechanics, and our Repair-Per-Hour (RPH) metric has improved dramatically,” says Chao.


Executives are increasingly relying on generated dashboards that pinpoint operational issues and explain fluctuations in revenue. “Instead of doing forensic analysis after the fact, our operations teams can find the root cause of an issue instantly,” he says.

Perhaps most importantly, Lime now has much greater visibility into its vehicles: where they’re located, how they’re parked, if they’re deployed correctly for current demand, and how they’re operating.

“Being able to understand the details of our business at a deeper level, and in the moment, is making a huge difference in Lime’s overall performance,” says Chao. “With Log10, we’re able to create genAI applications that our stakeholders can rely on.”