Nuxt HN | Ask HN: Best LLM model for a RAG-based Android app across all smartphones?

I am developing RAG based android app using llama.cpp. For offline processing I am using Qwen 1.5 2.5B model using Q4 quantization, I am offloading computation to GPU if present, However for low end android phones my android app is either crashing due to OOM or If not crashing and no GPU available in that case it takes lots of time to generate text. I also tried with SmolLM 135M model for low end devices but it struggle to follow instruction very well.

In this case I am considering openAI API for low end android phones. And for vector storage I am using in house developed https://github.com/hash-anu/snkv,

I am not sure how other people are running LLM model on low end android devices, I would appreciate any insights or best practices.