Meet ClaireBot — a Conversational RAG LLM App with Social Media Context Data
A true vanity project
There is no better way to get hands on with the latest LLM models and tooling than to scope and execute a personal project, so I built a lil personal chatbot named ClaireBot. ClaireBot is a Conversational RAG system augmented by my own social media data to inject my own personality, opinions, and knowledge into the system. A true vanity project.
This blog post will cover the following
- Why I built ClaireBot. Learn the keys to finding a meaningful personal project that enables actual learning and makes your portfolio standout.
- How I built ClaireBot. The steps I took to construct a Conversational LLMs App and what I learned.
Let’s build go it!
How to choose a valuable personal project
So why did I build ClaireBot? With the field of AI advancing so quickly, its important to stay up to date with the latest tools, tech, and best practices, and personal projects are a great way to stay current with emerging LLM technologies. While there are a variety of learning styles, and it’s important to know your own, there is always going to be great value in learning by doing something hands on. I know I can watch training videos and watch other people on youtube code all day, but the truth is that I will not completely grasps the concepts until I try it hands on. Learning by applying the concepts hands on has value across all styles because it is embedding that knowledge in memory.
So what makes a good personal project? Downloading a clean dataset from kaggle or somewhere similar and running prewritten curated code in a notebook is not going to blow a hiring manager’s sock’s off. This kind of low lift low effort project wont even differentiate your resume from the rest in the stack. Why not? These personal projects don’t simulate real life AI projects. As AI practitioner, your responsibilities will require so much more than that. You will need to identify and collect data, participate in project definition and scoping, work with users to understand their needs, choose the model and technical approaches based on your deep knowledge of there use-case and desired user experience, and more! You won’t be able to copy and paste code that works from a tutorial, you’ll have to be all over stack exchange and working with chatGPT to write that code, test it, debug it, and iterate. If you follow a premade tutorial with the data, code, and a project predefined and curated for you, you are skipping the hard parts. And it’s the hard part that forces us to learn and engrain the learnings in our memory.
Even though I do recommend doing personal projects the hard way, I’ll still cut us some slack here — the end result of a personal project does not need to be a perfect product. Remember the goal here is to learn the technologies and get hands on with it. Its ok to cut corners and create tech debt. Just be ready to answer the question in interviews “what would you do if you had more time?” or “what are your next steps to improve this project?”. Prioritizing project steps and tracking tech debt is also a useful real world skill we can learn and demonstrate on a personal project. So decide what you need to build as MVP, and take note of the next steps.
Here are the keys to defining a great personal project:
- Build for somebody. Have a customer in mind so you can really put yourself in the shoes of the user. Keep it fun. The customer could be yourself, a friend, or family member. View them as the customer, and put yourself in the customer’s shoes as you are building and defining the project.
- Understand the use case. The best AI Practitioners understand what they are building. This means understanding the field where you are applying AI, and understand what the end users will want to be doing with your AI system. This is best accomplished in a personal project by choosing to build for yourself or someone you know.
- Design the entire project end-to-end. Yes, you are managing your own project. You will need to define what the AI system does, what data it needs to do it, and how the user will interact with it. It will be a useful system because you are building with a user in mind, and not just following steps laid out for you or applying AI to an easy dataset to because that data was clean and easily available.f
- Identify, collect, and clean your own data. There are a few ways to do this. The key will be to start with the project definition first, and THEN go find the data you need to build it. This data can be collected in a few different ways such as web scraping, open APIs, or free online datasets like kaggle. You may even need to tag your own data. Oh my. Then deep dive into that data to understand the patterns and nuances in it. This will guide the modeling approach and tech you use.
- Scope an MVP. Write out the steps required for your project, then identify the right order to execute them in. Then order these steps based on what makes a MVP and make a cut line on what would be additional added benefit to explore during a next phase.
- Create “tech debt”. Document the tech debt you generate it. These are the things you’d like to do to improve on what you have. It’s ok if your code is not production quality, just know how to get it there if you needed it. (Showcasing production quality code could be another project, and it is the reason I advocate for a github portfolio.)
- Write your own code. Sure, you can and should copy, paste, and use Chatgpt. But in the end, you’ll need to tune that code to get it to work with your unique use case. This is where most of the learning happens. Understand what that code actually does. Read the documentation. You’ll be forced to do this if you’ve defined your own project and have to adapt the code to work for a new use case instead of following a prebuilt tutorial.
Meet ClaireBot!
Thanks for sticking with me this far! Now, take a journey with me through a challenging personal project! Here’s what I built. 🤖
Over the course of this project I had to put on a few different hats and execute steps of the project that might be considered “someone else’s job.” Although we may choose to become an expert at wearing one of these hats, being able to wear multiple hats will prepare us well for a job at a startup where we need to be able and willing to work outside a typical job description to add value at a lean company, or at a larger company who doesn’t have all the tools and data and project definition laid out for us because we’re working an a new and developing field of AI, and they need us to contribute thought and solutions in these area. What I mean to say is, no matter where you work, being able to move gracefully outside your lane will help you contribute and establish yourself as someone who brings solutions not problems — and that is a technical leader. So quit complaining and blaming, and grab your stack of hats, let’s build something.
Here are the AI Autogenerated illustrations of the hats I wore for this project. Wow! What a team!
Project Scoping
(We’ll start by putting on the Product Manager hat 🤠and the End User hat 🤠 as we define and scope what to build.)
What is ClaireBot? ClaireBot is the virtual version of the voice inside my head. It just has better memory than I do. ClaireBot should accurately reflect my personality and knowledge. As the end user, I want ClaireBot to say interesting things that I would say, and ultimately pass the turing test with my family members. This adds value to me as the end user, because I can then automate my conversations with my family — allowing me to spend time with my family while simultaneously working or enjoying life elsewhere at the same time! WOW! What a way to achieve WLB. Does this sound familiar? Is the Rick and Morty reference landing?
To kick off this project, I first took note of technologies I wanted to learn as a result of the project. As part of the project I would explore these and apply them if they were useful. This project covers;
- LLM orchestration tools
- VectorDBs
- Prompt engineering
- Hallucinations
- RAG
And there were some open questions I wanted to address with my work. Such as:
- Can I accomplish this project without incurring the cost of fine tuning? How far can I get without it using context data and prompt engineering? (Answer: Yes and very.)
- Are Hallucinations really a problem, or is that just what statistical inference looks like? (Answer: Depends on your use case)
- Static systems aren’t cool. How do I make this system improve overtime with the data I fed into it? Can this bot learn more about me from interacting with me? Can it grow with me as a person? Scary. (Answer: Update your context data!)
I outlined the execution steps for an MVP. Notice that the steps increase in complexity. The Fine Tuning and RLHF are advanced topics that I can explore if needed but I am not starting with these approaches, but instead I will try to see what I can get working without it and keep them in my back pocket for the next iteration of the project.
MVP Project Steps
Step 1: Make a dumb Chatbot
- Create a conversational chatbot with memory and a simple default prompt.
Step 2: Personalize the Chatbot
- Use Prompt Engineering to craft a great prompt that gives the right instructions.
- Inject Context data.
- Collecting my own social media data through APIs.
- Embed and chunk the context data and put it in a vectorDB.
- Create a retrieval method to pull relevant context data from the vectorDB.
- Inject the prompt with context data.
Step 3: Create the Feedback Loop
- Build in a feedback loop to collect message history data so this system can improve and learn from human interactions.
Step 4: Evaluate the system
- Are responses factual?
- Are there hallucinations? What do these mean in the context of this use case?
- Can ClaireBot pass the turing test?
Advanced improvement option for the next phase of the project are:
- Explore Evals with LLMs.
- Explore Fine tuning as an alternative to the context retrieval
- Explore RLHF to improve the system over time.
Data Collection and Cleaning
( 🤠 Put on your Data Engineer hat)
My goal was to collect data that reflected my personal opinions, personality, and writing style. Here are some ways I downloaded my personal data from social platforms:
- LinkedIn messages and comments (yes, if you’re sliding into my DMs on LI trying to sell me something, our chats are now part of ClaireBot. Careful, you may even be talking to ClaireBot now and not know it!) (link).
- Instagram posts and comments (link).
- Google data including gmail, google keep notes, and my own short stories corpus in google drive (link).
There are two types of knowledge data in Generative AI systems:
- Parametric knowledge — This is the data the model was trained on, and its encoded in the pretrained models weights. Fine tuning would adapt this pretrained model and integrate new parametric knowledge.
- Source knowledge — This is context data injected into the prompt to augment the knowledge with new information not encoded as parametric knowledge. We can add source knowledge though Retrieval Augmented Generation (RAG), where the relevant source data is retrieved and then ingested as context into the prompt template.
We’re going to start by working with source knowledge data. I created a vector database with Chroma, chunked up my personal data, embedded it with OpenAI embeddings, and shoved the embeddings into the vector database.
All this was accomplished easily in LangChain in a jupyter notebook. I’m not sharing my code because 1. It’s messy, 2. I might sell this someday, and 3. y’all need to be writing your own code rather than copy pasting mine so you learn something too. But I’ll link all the docs and blog posts I used as I wrote my code so you can get started with some resources.
I used Arize Phoenix UMAP visualization to analyze the embeddings. Here, you can see natural clusters identified in the data.
What surprised me is how easy it was to get value from fairly unprocessed and uncleaned data. For old school tabular ML models I’d have to spend a lot of time formatting the data, and performing transformations on it to get it to a point where it can be used meaningfully in a model. Instead of preprocessing this data, I added information into the prompt to describe how the data should be interpreted and used.
Creating a ChatBot Chain
(🤠 Put on your Data Scientist hat for the rest of the show.)
LangChain is open source and provide a set of building blocks that allow us to create LLM apps in python. There are different kinds of LLM apps such as Q&A systems, Conversational Bots, etc, and LangChain provides us with tools for doing chaotic stuff like connecting a OpenAI based ChatBot to an open source vectorDB full of your personal data.
I needed my app to function as a Chatbot so I could interact with it conversationally and have it retain memory of the earlier part of that conversation. A Simple Q&A system would not be enough here. LangChain facilitated this okishly with some limitations and workarounds. The Bot I created powered by gpt-3-turbo from OpenAI as the pretrained model under the hood.
Engineering a Prompt
Prompt engineering is how you give a robot it’s purpose. The prompt serves as a set of instructions defining how the bot will interact and respond. It can include guardrails defining what it is allowed and not allowed to do.
I experimented with the prompt quite a bit to force the bot to not break character and to give me fun answers instead of just telling me it’s a bot.
RAG
I wanted my app to rely on relevant social median context data from the vector store to provide meaningful answers. However, usin all the data possible is not reasonable. That will be expensive and confusing. Instead, we need to pull out top 5 most meaningful document chunks and use that. Retrieval Augment Generation, or RAG, is a popular method for retrieving relevant dat and augmenting the Generative AI system with this data that was not in the original training data. We need RAG because the foundational pretrained models are often trained on stale data. Gpt-4, for example, does not include data since January 2022. It has no knowledge of recent events.
Note that the knee jerk reaction to fix this would be fine tuning, where we train the foundational model by feeding in the context data. However, this is expensive and is probably not the right approach for an MVP, although we will explore it and compare fine tuning to RAG in a follow up post.
Also, as I alluded to before, long prompts are ineffective and expensive. If I shove all the data I got as context data into the prompt, the LLM can get confused by the data and may not be able to follow the instructions in the prompt LLM. So instead, I need a fast way to retrieve only relevant information to add to the prompt.
This is done through retrieval from a vector store. I used a cosine similarity retrieval method that returns the top k docs based on embedding similarity scores to ensure the content used is relevant to the query.
Feedback Loop
One of the goals I set out on this project was to build in the ability for the system to learn from interactions with me over time. The Conversational Bot has context of previous chats during a session, but it does not maintain long term memory and update its knowledge over time. To build this in, I simply scooped up the chat history memory at the end of session, embedded it, and shove into our existing vector database next to all my social media data.
Now check this out. ClaireBot is no longer a static system, but is learning in real time!
Evaluation
We’ll explore how to evaluate the system using LLM Evals in a follow up post, but as first pass sanity check for the MVP, we’’re going to think about Hallucinations, and run a Turing Test for fun.
Hallucinations
“Hallucinations” is the term used to refer to the phenomenon when LLMs will generate information that is not accurate or was not present in the training data or prompt. A lot of systems will require only factual results, and hallucinations can be mitigated with monitoring, filling context data gaps, and baking guardrails into the prompt. But that is not what this blog post is about.
I want something more powerful. Hear me out. I actually like the hallucination behavior. When the system hallucinates, it is using the data available to come up with something new — it is make a probabalistic prediction. Instead of regurgitating facts its been fed, or telling you it can’t answer those questions (one of the most annoying responses imho), the hallucination is the models way of inferring a likely response. I want to build a system that can come up with something NEW that is LIKELY based on it’s current knowledge and purpose.
As a guardrail in my prompt, I encourage ClaireBot to hallucinate . I don’t want this robot to regurgitate my knowledge and past conversations, I want it to use that data to infer what I would say in a new scenario. Man, I love conditional probabilities!
Here are some alarming yet fun ClaireBot responses that are probable but not factual.
The Turning Test
Over the next few weeks, ClaireBot will be helping me respond to messages in the family group chat (shhhh this is the Turing Test 🤫) . We’ll keep running this experiment and check back in a few weeks to see if ClaireBot has become a valued member of the family.
What Next for ClaireBot?
I’m going to keep interacting with ClaireBot to allow the system to learn over time. If it becomes sentient, that’d be cool. But I’ll be happy if I managed to create a system that can improve over time automatically.
There are some next steps to improve ClaireBot’s brain and UI that I took note of during the project. I’ll look at Evals, Fine Tuning, and RLHF in a followup blog post.
It needs more data. I could augment the system with more personal data (medium blog posts, short stories, slack history, etc). I’d love to build in automations to pull new social data on a cadence to update the vectorDB automatically.
ClaireBot needs a better interface. I’d love to deploy and host ClaireBot outside the notebook, and maybe even move ClaireBot into an iOS app so I can chat with her all the time when I’m lonely and want validation from the extreme echo chamber I just created.
See, we’re saying “her” now. ClaireBot is alive!
Resources I used to Help Me Build
- https://python.langchain.com/docs/modules/memory/adding_memory_chain_multiple_inputs
- https://medium.com/@rubentak/unleashing-the-power-of-intelligent-chatbots-with-gpt-4-and-vector-databases-a-step-by-step-8027e2ce9e78
- https://bdtechtalks.com/2023/07/10/llm-fine-tuning/
- https://www.datacamp.com/tutorial/building-context-aware-chatbots-leveraging-langchain-framework-for-chatgpt
- https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.ChatPromptTemplate.html
- https://medium.com/@rubentak/unleashing-the-power-of-intelligent-chatbots-with-gpt-4-and-vector-databases-a-step-by-step-8027e2ce9e78
- https://towardsdatascience.com/all-you-need-to-know-about-vector-databases-and-how-to-use-them-to-augment-your-llm-apps-596f39adfedb
- https://docs.arize.com/arize/model-types/large-language-models-llm