I developed a local Salesforce LLM Assistant that runs on your computer
TL;DR
I built a Salesforce Lightning Web Component that lets you run powerful AI language models (LLMs) directly on your computer within Salesforce. It uses Pico LLM technology to process data locally, keeping your information secure and responding quickly. You can use it to generate emails, write content, analyze customer data, and more, all without relying on external services. Check out the demo video and GitHub repo to learn more!
I’ve been experimenting with the local LLMs inside Salesforce and would like to tell you about the component I developed as a result. It has the already familiar chat interface which uses Salesforce records for context. It works locally on your computer and processed data is not being sent to any third-party service.
It was during the time the Agentforce was introduced which had its influence on my component. The Agentforce uses agents — systems that can make decisions and perform various actions. Assistants, in contrast, only process information reactively. Even though I believe it’s possible to build a local agent using Pico LLM, it would take enormous effort. That’s why I have decided to develop an assistant instead.
Features
As you would expect an LLM to work, it generates responses on any topic as it’s pretrained on a vast set of data. Moreover, it’s able to use Salesforce records for extra context.
- Works with different models. You can use any open-source model provided on the Pico website, from Gemma to Llama and Phi. The only limitation here is the amount of RAM your computer has. The more the model weighs the more RAM it consumes.
- Works with a single record. When the component is being placed on a record page, then it’s able to access the record for context. For example, being on an Account record detail page, it can generate a response based on its field values.
- Works with related records. There can be a case when the current record has related records. The component can query any type of related records and generate a response taking them into account.
- Configurable. The component can be configured on the fly, using the configuration popup. It allows to change the generation options, such as completion token limit, temperature, and top P.
How it works
From the end user point of view everything is fairly simple. You upload a model, select a system prompt, select records, write a user prompt, and look at the result being generated.
What is Pico LLM?
Running LLMs in a browser is a resource-consuming task because of the models’ size, bandwidth requirements and RAM needs. Therefore, the Pico team developed their picoLLM Compression technique which makes usage of LLMs locally much more efficient for computers. They provided the picoLLM Inference Engine, as a JavaScript SDK, to allow front-end developers to run LLMs locally across browsers. It supports all modern browsers including Chrome, Safari, Edge, Firefox, and Opera. To know more about how the picoLLM Inference Engine works, you can read their article.
The LWC part
The component serves as a bridge between a user and PicoLLM interface. In the heart of the component lies a visualforce page in the form of an iframe. The page loads the PicoLLM SDK and communicates with the LWC allowing the last to use SDK via post messages. The whole combination of elements handles the following:
- Loading a model. The LWC has a button which allows you to load a model of your choice. It triggers a file input element hidden inside the iframe. Once the model is loaded, the Pico SDK creates web workers, and the component is ready to process the user input.
- Setting a system prompt. You don’t have to write a system prompt every time, it’s easy to select once saved as records of the
System_Prompt__c
object. Once the button is pressed, it shows the popup with the existing system prompts to choose from. - Accepting user input. There is a resizable text area for collecting user input. When collected, it’s sent to the iframe as a payload and added to the conversation history.
- Accessing Salesforce records. There are two buttons: Select Fields and Select Related Records. The first one collects the field values of the record on a record page of which the LWC resides. The second allows to choose a related object and query its records along with the selected field values. This information is sent to the iframe as a payload as well.
- Changing generation options. If desired, the completion token limit, temperature, and top P options can be changed via a dedicated button in the component. This information is also sent as a payload to the iframe.
- Generating a result. When the iframe recieves the payload, it uses the Pico SDK to utilize the loaded model and generate a result. If generation options were provided, they are taken into account. Also, the dialog is being updated so the LLM will remember the history of it.
- Rendering chat messages. The LWC is able to render outcoming messages, which are the ones a user provided. The incoming messages, containing the generated response, are being rendered dynamically once the component has anything to say to a user. Such as generated result or information and error messages.
A little bit of Apex code
On the back-end side of things there is nothing fancy. The Apex code does all the heavy lifting related to detecting the relationships between the objects using a record Id from the record page. Also, it performs a couple of SOQL queries, and its duty is done here.
Development Challenges
Web workers
Previously, I used the unpkg tool to execute code from the node module in LWC component. This approach led to additional configuration steps, and was a less secure way to make it work. This time, I wanted to execute the PicoLLM module directly from Salesforce and not only from the Experience Cloud site, but the Lightning Experience interface.
Under the hood, PicoLLM uses web workers for parallel processing, and it was the main problem because it’s not allowed to run them from LWC. Luckily, no one refused to let us run web workers from a visualforce page, and it was the approach I used.
I downloaded the raw PicoLLM code and added it as a static resource to the visualforce page. In LWC I used an iframe which contained the visualforce page. The communication between the LWC and the page inside the iframe allowed me to use web workers. The page allows to trigger the PicoLLM-related code from the lightning web component.
Using Salesforce records for context
Copy and paste Salesforce records in a JSON or CSV format, throw it to any online LLM and watch. It will consume the records, use them for extra context and generate a response. It turned out that it is not that easy when using compressed models for local processing.
At first, I was simply putting the records in JSON format right into user prompt. Then I expected the thing to be smart enough to distinguish the prompt itself from the additional context I provided. I used different models of varied sizes and didn’t understand why it wasn’t using the JSON for generating responses. It was mostly refusals to respond to my prompt or generation of fictional data not related to what I asked it to do. I started to experiment with different formats of the context data. I used CSV and I used JSON, used prompt dividers to strictly differentiate prompt from context — nothing helped.
I was ready to abandon the whole idea since the main feature didn’t work. After a couple of months, I suddenly got a stupidly simple idea. What if I just reversed the order of prompt parts? From user prompt coming first and context coming the second, to context coming first and prompt the second. To my surprise it worked, and any model I used immediately started to understand Salesforce records for context.
Performance
The component’s functionality was tested on these machines:
- PC with the AMD Ryzen 9 9900X processor and 32GB of RAM (5600 MT/s).
- Microsoft Surface Laptop 7 powered by the Snapdragon X-Elite ARM processor with 16 GB of RAM (8448 MT/s).
Model loading speed — it’s all about memory
The most time-consuming part of using the component is the initial model loading. You might expect the 9900X to easily outperform the Snapdragon X-Elite, but you’d be wrong. To my surprise, the latter is faster. Since it has faster memory, I presume that the faster your RAM, the faster the model loads.
Response generation speed
The same story with the response generation speed. As I understand, you need to have a fast combination of CPU and RAM to get the fastest generation possible. Since the generation results are not always the same for the same user prompt I didn’t test its speed.
What about using a GPU?
Indeed, using a GPU to generate responses would be much more efficient. While it’s possible to use a GPU with PicoLLM, I haven’t tested that configuration myself. There are a couple of reasons for this. First, I believe it uses the WebGPU feature, which isn’t enabled by default in most browsers (except Edge). Second, it likely requires several gigabytes of VRAM to load the model which I don’t have.
Conclusion
Developing this assistant has been a fascinating journey of exploration. From grappling with web worker limitations to discovering the crucial role of prompt order in providing context, the challenges have been both stimulating and rewarding. The result is a Lightning Web Component that offers a unique approach to leveraging the power of Large Language Models within the Salesforce ecosystem.
While the initial model loading time can be a consideration, especially for larger models, the ability to process data locally offers significant advantages in terms of data security, responsiveness, and cost-effectiveness. The potential use cases, from automating content generation to providing intelligent assistance, are vast and waiting to be explored.
Check out the GitHub repo.