Run Local LLMs on iPhone or Mac Easily Using Private LLM

In the rapidly evolving world of artificial intelligence, Large Language Models (LLMs) have become powerful tools for a wide range of applications. However, concerns about privacy, security, and internet connectivity have left many users seeking a more secure and flexible solution. Private LLM is an innovative app that addresses these concerns by allowing users to run LLMs directly on their iPhone, iPad, and Mac, providing a secure, offline, and customizable on-device chatbot experience without an API key.

Private LLM offers a host of features that set it apart from other AI chatbot apps:

  1. Offline Functionality: Ensures privacy and security by processing data locally on your device, without an internet connection.
  2. Support for Open-Source AI Models: Provides access to a wide range of models, including Llama 3, Phi-3, Google Gemma, Mistral 7B, Yi-6B, and many more.
  3. Integration with iOS and macOS Features: Seamlessly integrates with Siri, Apple Shortcuts, and macOS services, allowing users to create powerful AI-driven workflows.
  4. One-Time Purchase: Provides a cost-effective solution with a single purchase price for all Apple devices, including Family Sharing for up to six family members. Users can download new models for free, with no subscription required. If you're not satisfied, you can request a refund through the App Store.
  5. Uncensored Models: Includes models like Dolphin 2.9 Llama 3 8B 🐬, Dolphin 2.6 Phi-2 🐬, and TinyDolphin 2.8 1.1B Chat 🐬 that provide an uncensored AI chatbot experience, ideal for NSFW topics and roleplay scenarios.
  6. Superior Model Performance with OmniQuant Quantization: Private LLM outperforms other local AI solutions through its advanced model quantization technique known as OmniQuant. Unlike the simple Round-To-Nearest (RTN) quantization used by competing apps, OmniQuant is an optimization-based method that uses learnable weight clipping. This allows for more precise control over the quantization range, effectively maintaining the integrity of the original weight distribution. As a result, Private LLM achieves superior on-device model performance and accuracy, nearly matching the performance of an un-quantized 16-bit floating-point (fp16) model, but with significantly reduced computational requirements at inference time. Additionally, unlike almost every other competing offline LLM app, Private LLM isn't based on llama.cpp. This means advanced features that aren't available in llama.cpp (and by extension apps that use it) like attention sinks and sliding window attention in Mistral models are available in Private LLM, but unavailable elsewhere.

Getting Started with Private LLM for iOS and macOS

Private LLM is the best way to run on-device LLM inference on Apple devices, from the latest models to older ones. The app comes with built-in models that work well even on older devices, ensuring that all users can enjoy the benefits of local GPT. Users can download Private LLM directly from the App Store.

Screenshot of Phi-3 Mini 4K Instruct Running Locally on iPhone
Phi-3 Mini 4K Instruct Running on iPhone
Screenshot of Mistral 7B Instruct Running on Mac
Mistral 7B Instruct Running on Mac

Understanding System Prompts

A system prompt in an LLM is a set of instructions or guidelines that help steer the model's output in a desired direction. It allows users to tailor the AI's behavior for specific tasks or scenarios.

System Prompt Example on Private LLM for iPad
System Prompt Example on Private LLM for iPad

For example, if you wanted to use Private LLM as a creative writing assistant, you could provide a system prompt like: "You are an imaginative story writer. Expand on the user's writing prompts to create vivid, detailed scenes that draw the reader in. Focus on sensory details, character development, and plot progression."

Or if you needed help studying for a history exam, your prompt might be: "Act as an expert history tutor. When the user asks about a historical event or period, provide a concise summary of the key facts, dates, and figures. Then ask follow-up questions to test their understanding and help reinforce the material."

The image displays the system prompt screen of the Private LLM app on an iPad.
The intuitive system prompt screen lets you engage in dynamic conversations with your very own secure, on-device AI assistant

Common Use Cases for System Prompts Include:

  • Roleplaying, where the prompt sets the character and context for the AI to adopt
  • Task-oriented prompts that instruct the AI to perform actions like summarization, translation, code generation, or data analysis
  • Persona prompts to give the AI a consistent personality and communication style
  • Creative prompts for brainstorming ideas, writing poetry or jokes, or generating artwork descriptions

The possibilities are endless. By crafting targeted system prompts, users can get the most out of LLMs and adapt them to their unique needs. Experimenting with different prompts is a great way to explore the on-device capabilities of models like those in Private LLM.

Understanding Sampling Settings

Private LLM provides users with control over the sampling settings used by the AI model during text generation. The two main settings are Temperature and Top-P.


Temperature is a value that ranges from 0 to 1 and controls the randomness of the model's output. A lower temperature (closer to 0) makes the output more focused and deterministic, which is ideal for analytical tasks or answering multiple-choice questions. A higher temperature (closer to 1) introduces more randomness, making the output more diverse and creative. This is better suited for open-ended, generative tasks like story writing or brainstorming.

For example, if you're using Private LLM to generate a poem, you might set the temperature to around 0.8 to encourage the model to come up with novel and imaginative lines. On the other hand, if you're asking the AI to solve a math problem, a temperature of 0.2 would ensure it stays focused on finding the correct answer.

Screenshot showing Private LLM's sampling setting on iPhone
Private LLM's Sampling Settings on iPhone


Top-P, also known as nucleus sampling, is another way to control the model's output. It works by selecting the next token (word or subword) from the most probable options until the sum of their probabilities reaches the specified Top-P value. A Top-P of 1 means the model will consider all possible tokens, while a lower value like 0.5 will only consider the most likely ones that together make up 50% of the probability mass.

Top-P can be used in combination with temperature to fine-tune the model's behavior. For instance, if you're generating a news article, you might set the temperature to 0.6 and Top-P to 0.9. This would ensure the output is coherent and relevant (by limiting the randomness with temperature) while still allowing for some variation in word choice (by considering a larger pool of probable tokens with Top-P).

Meta Llama 3 70B Running Locally on MacBook Pro
Meta Llama 3 70B Running Locally on MacBook Pro

In general, it's recommended to adjust either temperature or Top-P, but not both at the same time. Experiment with different values to find the sweet spot for your specific use case. Keep in mind that the optimal settings may vary depending on the model and the type of task you're performing.

Choosing the Right AI Model for iPhone and iPad

When selecting a model in Private LLM for your iPhone or iPad, there are options to suit various devices and needs. The app works with many models to accommodate different devices and goals. For more powerful iPhones and iPads, models like Llama 3 8B or Phi-3 Mini are solid picks. For older devices, models like H2O Danube or TinyLlama may work better. If you prefer an uncensored model, consider Eric Hartford's Dolphin series such as Dolphin 2.9 Llama 3 8B 🐬 or Dolphin 2.6 Phi-2 🐬. Think about your device's capabilities and what you plan to use the model for to select one that will work well for on-device inference.

Screenshot of Meta Llama 3 8B Instruct Running on iPad
Meta Llama 3 8B Instruct Running on iPad

Choosing the Right AI Model for Mac

When choosing a model in Private LLM for your Mac, there are various options to cater to different configurations and requirements. For the most powerful Apple Silicon Macs with 48GB or more RAM, we recommend using the Meta Llama 3 70B Instruct model. This 4-bit OmniQuant quantized version of the model, developed by Meta, is part of the Llama 3 family of large language models and has been optimized for dialogue use cases.

Meta Llama 3 70B Running Locally on MacBook Pro
Meta Llama 3 70B Running Locally on MacBook Pro

For Macs with 32GB or more of RAM, the Mixtral 8x7B Based Models are a great choice. If you need bilingual English-Chinese support and have 24GB or more of RAM, Yi 34B is the way to go (for 8GB Macs, Yi-6B is suggested). Older Intel-based Macs will benefit from Meta Llama 3 8B, Google Gemma, StableLM 3B, or any of the Mistral 7B Based Models.

For users interested in uncensored NSFW content and AI roleplay conversations, Private LLM offers Dolphin 2.9 Llama 3 8B 🐬, Dolphin 2.6 Mixtral 8x7b 🐬, Dolphin 2.6 Phi-2 🐬, and Dolphin 2.1 Mistral 7B. Creative writers with 16GB Macs should consider Nous Hermes 2 - Solar 10.7B for their needs.

Explore a wide range of open-source LLM models supported by Private LLM. Each model has unique characteristics, such as context length, performance, and uncensored capabilities, so you can choose the one that best fits your on-device needs.

Understanding Context Length

Context length refers to the maximum amount of text that an LLM can process in a single input. It's important to choose a model with the appropriate context length for your needs.

iPhone or iPad

For iOS devices, if you plan to feed large amounts of text into Private LLM, opt for models like Llama 3 8B or Google Gemma, which offer a full 8K context. This allows the AI to understand and respond to longer passages, making it ideal for tasks like document analysis or extended conversations on your device. For older iOS devices, H2O Danube 1.8B provides a 4K context, while TinyLlama 1.1B offers a 2K context for more lightweight applications.


On Mac, the Mistral Instruct v0.2, Nous Hermes 2 Mistral 7B DPO, and BioMistral 7B models now support a full 32K context. This extended context length enables the AI to process and respond to even longer passages, making it perfect for tasks involving large documents, complex analyses, or in-depth discussions. For users who require a balance between context length and performance, models like Llama 3 8B or Google Gemma, with their 8K context, remain excellent choices.

Downloading New AI Models

To download new models in Private LLM, navigate to the app's settings and select the desired model. Keep in mind that model sizes vary, typically ranging from a few hundred megabytes to several gigabytes.

Due to iOS limitations, you must keep the app open during the download process on your iPhone or iPad, as background downloads are not supported. Additionally, users can only download one model at a time on iOS devices. These limitations do not apply to Mac.

Be patient, as larger models may take some time to download fully. Remember, with Private LLM's one-time purchase, you can download new models for free forever, with no subscription required.

Downloadable LLM Models on Private LLM for iPad
Downloadable LLM Models on Private LLM for iPad
Downloadable LLM Models on Private LLM for Mac
Downloadable LLM Models on Private LLM for Mac

Integrating with iOS and macOS Features and Custom Workflows

Private LLM's integration with Apple Shortcuts is one of its most powerful features. Users can automate tasks and create custom workflows by combining Private LLM with this built-in app. This integration allows users to incorporate on-device AI into their daily routines and develop personalized, efficient solutions for various tasks.

Screenshot of iPhone showcasing shortcuts created with Private LLM
Shortcuts created with Private LLM on iPhone
Screenshot of Mac showcasing shortcuts created with Private LLM
Shortcuts created with Private LLM on Mac

By crafting specific prompts and integrating them with Apple Shortcuts, users can guide the AI model to produce desired outputs. This technique, known as prompt engineering, enables the creation of powerful workflows tailored to individual needs. For example, the "Ask Llama 3" shortcut demonstrates a simple way to interact with a private on-device AI. Users can dictate their query in English, which is processed by the "Dictate Text" action and sent to Private LLM via the "Start a new chat with Private LLM" action. The AI's response can be displayed instantly or copied to the clipboard for later use. This user-friendly shortcut showcases the seamless integration of voice commands and AI chatbot functionality, making it an excellent introduction to the potential of combining these technologies.

Example of Ask Llama 3 Shortcut on iPhone
Ask Llama 3 Shortcut Example on iPhone
Screenshot of Shortcuts called unChatGPTfy created using Private LLM
unChatGPTfy Shortcuts created using Private LLM on Mac

With Private LLM's support for various models, users can experiment with different prompts and settings to find the perfect combination for their needs, creating workflows such as generating meeting summaries from voice recordings, translating text between languages, creating personalized workout plans, and analyzing emails for better organization. The possibilities are vast, and users can unlock the full potential of on-device AI in their daily lives through Private LLM and Apple Shortcuts.

Additionally, Private LLM supports the popular x-callback-url specification, which is compatible with over 70 popular iOS and macOS applications. Private LLM can be used to seamlessly add on-device AI functionality to these apps. Furthermore, Private LLM integrates with macOS services to offer grammar and spelling checking, rephrasing, and shortening of text in any app using the macOS context menu.

Screenshot showing the Private LLM integration within the macOS system-wide services menu.
Private LLM integration within the macOS system-wide services menu

Uncensored AI Models

Private LLM provides access to uncensored on-device AI models like Dolphin 2.9 Llama 3 8B 🐬, Dolphin 2.6 Mixtral 8x7b 🐬, and Dolphin 2.6 Phi-2 🐬 for users who want to explore topics that other chatbots might block, including NSFW content or imaginative roleplaying. These unrestricted models open up possibilities for freeform, open-ended conversations. However, it's crucial that users approach this capability thoughtfully and use it ethically. With great conversational freedom comes great responsibility. Always keep in mind the potential impact of your interactions, even with an on-device AI system, and strive to use this technology in a way that is respectful, constructive, and in line with your values.

Screenshot of Dolphin 2.9 Llama 3 8B Uncensored running on iPhone
Dolphin 2.9 Llama 3 8B Uncensored running on iPhone
Screenshot of Dolphin 2.6 Mixtral 8x7b Uncensored running on Mac
Dolphin 2.6 Mixtral 8x7b Uncensored running on Mac

Download Private LLM to Run LLMs Locally on iPhone, iPad, and Mac

Private LLM is the best way to run on-device LLM inference on Apple devices, providing a secure, offline, and customizable experience without an API key. With its support for over 30 models, seamless integration with iOS and macOS features, and the ability to create powerful custom workflows through prompt engineering, Private LLM empowers users to harness the full potential of on-device AI. Whether you're using the built-in model on an older device or exploring uncensored conversations with Dolphin models for NSFW content and AI roleplay, Private LLM offers an unparalleled on-device GPT

Download Private LLM on the App Store
Stay connected with Private LLM! Follow us on X for the latest updates, tips, and news. Want to chat with fellow users, share ideas, or get help? Join our vibrant community on Discord to be part of the conversation.