How to get started? A number of questions ....

Hi there, If I'm looking to use LLM AI in a similar way like Stable Diffusion, i.e. running it on my own PC using pre-trained models (checkpoints?) - where would I start?

If I would want to have access to it on my mobile devices - is this a possibility?

If I would then later want to create workflows using these AI tools - say use the LLM to generate prompts and automatically run them on Stable Diffusion - is this a possibility?

I'm consistently frustrated with ChatGPT seemingly not beeing able to remember a chat history past a certain point. Would a self-run model be better in that regard (i.e. will I be able to reference somethin in a chat thread that happened 2 weeks ago?)

Are there tools that would allow cross-thread referencing?

I have no expert knowledge whatsoever, but I don't shy away from spending hours learning new staff. Will I be able to take steps working towards my own personal AI assistant? Or would this be way out of scope for a hobbyist?

7 comments

Depends on your hardware and how far you're willing to go. For serious development I think you need at least 12-16 GB of VRAM, but there's still some things you can do with ~8. If you just have a cpu, you can still test some models but generation will be slow.
I'd recommend trying out the oogabooga webui. This should work with quite a few models on hugging face. Hopefully I don't get in trouble for recommending a subreddit but r/localllama has a lot of other great resources and us a very active community. They're doing exactly what you want.
As far as your other questions...
Accessing it on your phone is going to be tricky. You would most likely want to host it somewhere but I'm not sure how easy that is for someone without a bit of software background. Maybe there is a good service for this, huggingface might offer something.
Cross thread referencing is an interesting idea. I think you would need to create a log store of all your conversations and then embed those into a a vector store (like milvus or weaviate or qdrant). This is a little tricky since you have to decide how to chunk your conversations, but it is doable. The next step is somewhat open ended. You could always query your vector store with any questions that you are already sending your model, and then pass any hits to the model along with your original question. Alternatively, you could tell the model to check for other conversations and trigger a function call to do this on command. A good starting point might be this example, which makes references to a hardware manual in a Q&A style chatbot.
Using an LLM with stable diffusion: not especially sure what you are hoping to get out of this. Maybe to reduce boilerplate prompt writing? But yes you can finetune a model to handle this and then have the model execute a function that calls stable diffusion and returns the results. I am pretty sure langchain provides a framework for this. Langchain is almost certainly a tool you will want to become familiar with.
- Thank you for the input! I recently upgraded my PC to be able to handle Stable Diffusion, and I got 12GB of VRAM to work with at the moment. I also have recently started to self-host some applications on a VPS, so some basics are there.
  As for what I'd like to do with Stable Diffusion: One of my hobbies is storytelling and worldbuilding. I would like to (one day) be able to work on a story with a LLM and then prompt it: "now give me a drawing of the character we just introduced to the story" and the LLM would automagically rope in Stable Diffusion and produce a workable drawing with it. I think that this is probably beyond the capability of the current tools, but this is what I would like to achieve. I will definitely look into langchain to see what I can do with it.
  That's also where the questions about context length and cross thread referencing come from. I did some work with ChatGPT and am amazed at how good a tool it is to "brainstorm with myself" in developing stories. However, it does not remember the story bits I've been working on 2 hours ago, which kinda bummed me out .. :)
- Thanks for the input!
  I recently built a new PC to handle Stable Diffusion, that gives me 12GB of VRAM to work with. I also started to self-host a few things on a VPS recently, so I have a bit of a basis there.
  As for Stable Diffusion integration: I do storytelling/worldbuilding as a hobby and find LLM's to be an amazing tool to "brainstorm with myself". It would be amazing if I could tell the LLM to "make a picture of the new character" and it would connect to and prompt SD accordingly. I assume that this is out of scope of what's currently possible, but something like that would be my goal. I will certainly have a look at langchain as you proposed. Also that's the context of me asking about cross referencing and context length. I've been working with ChatGPT, and while it is an amazing tool it had me bummed out when it couldn't reference a character that was developed a couple hours earlier (even in the same thread). The cross referencing solution that you sketched above might work for me, but I guess it'll take a while to learn how to do it.
  Given this as a bit of context: where should I start? Downloading Llama 2 as another reply suggests and go from there?
  
  Seems reasonable. I'll add in that there are models specifically finetuned for storytelling. You might check out this thread for some other model suggestions. I think you will also likely want to find a framework for RLHF.
You should download Llama 2 from Meta, as that is the best open source LLM right now. It comes in 7B,13B,and 70B sizes, as well as chat versions of those sizes. You'll need a good computer to run them, but if you're already running Stable Diffusion you should be fine.
I think Llama 2 has a python API, so you should be able to use it as prompt for SD, as long as it also has a python API.
Llama 2 actually has a smaller context length than chatGPT (it will remember less of the conversation), but you can use hacks like using a separate prompt to summarise the conversation, then another one to find the relevant parts of it in relation to your actual prompt, and then finally use that selected part of the conversation history in your prompt.
- I have a decent CPU and GPU with 12GB VRam - this should let me run the 7B at least, from what I have seen in the sticky post.
  Beside downloading the model, what kind of UI should I start with? Are there good tutorials around, that you are aware of?
  
  If you're using llama.cpp it can split the work between GPU and CPU, which allows you to run larger models if you sacrifice a little bit of speed. I also have 12 GB vram and I'm mostly playing around with llama-2-13b-chat. llama.cpp more of a library than a program, but it does come with a simple terminal program to test things out. However many GUI/web programs use llama.cpp so I expect them to be able to do the same.
  As for GUI programs I've seen gpt4all, kobold and silly tavern, but I never got any of them to run in docker with GPU acceleration.

7 comments