Running Gemma 4 Locally with Ollama on Your PC

Open-weight fashions are driving the newest pleasure within the AI panorama. Operating highly effective fashions domestically improves privateness, cuts prices, and allows offline use. However the open-source fashions are far and few! However Google‘s Gemma 4 is right here to vary that!

This information walks by way of what Gemma 4 is, would explores its variants, and descriptions the {hardware} wanted for its efficiency. You’ll additionally see tips on how to take a look at your setup and construct a Second Mind AI venture powered by Google’s Gemma 4. We’ll additionally use Claude Code CLI to streamline growth and combine workflows.

Understanding Gemma 4

Gemma is Google’s household of open-weight language fashions, and Gemma 4 marks a major step ahead. It brings stronger reasoning, higher effectivity, and broader multimodal help, dealing with not simply textual content but additionally photographs, with some variants extending to audio and video. The fashions are designed to run domestically, making them sensible for privacy-sensitive and offline use circumstances.

Learn extra: Gemma 4: Arms-On

Gemma 4 Variants

There are 4 totally different Gemma 4 variants. These embrace E2B, E4B, 26B A4B, and 31B. The E2B and E4B are abbreviated to the which means of efficient parameters. These fashions are applicable to edge gadgets. The 26B A4B is predicated on Combination-of-Consultants (MoE) structure. The Dense structure is used within the 31B.

Mannequin	Efficient/Lively Params	Whole Params	Structure	Context Window
E2B	2.3B efficient	5.1B with embeddings	Dense + PLE	128K tokens
E4B	4.5B efficient	8B with embeddings	Dense + PLE	128K tokens
26B-A4B	3.8B lively	25.2B complete	Combination-of-Consultants (MoE)	256K tokens
31B	30.7B lively	30.7B complete	Dense Transformer	256K tokens

The MoE construction permits effectiveness. Solely explicit professionals come into play over some activity. This renders greater fashions to be manageable. The Dense structure employs all of the parameters. All of the Gemma 4 variations have their very own benefits.

Setting Up Gemma 4 on Your PC with Ollama

Ollama offers a easy method. It assists within the straightforward working of native LLMs. Ollama is user-friendly. Its set up is easy. It manages fashions effectively. Ollama 4 is domestically obtainable in Gemma 4 with Ollama.

Set up Information

Set up Ollama in your PC. Set up the appliance utilizing the Ollama official website. Drag the appliance to your Functions. Open Ollama from there. It operates in your menu bar.

Then obtain Gemma 4 fashions. Open your terminal. Enter the ollama pull command. Point out the suitable tags.

For E2B: ollama pull gemma4:e2b
For E4B: ollama pull gemma4:e4b
For 26B A4B: ollama pull gemma4:26b
For 31B: ollama pull gemma4:31b

This fetches the mannequin information. You now have Gemma 4 domestically with Ollama.

{Hardware} Configuration

Have in mind the {hardware} of your PC. Gemma 4 variants have various wants.

In E2B and E4B: These fashions are appropriate with many of the fashionable laptops. They want a minimal of 8GB of RAM. In accordance with a latest survey, 75 % of the builders have 16GB RAM or greater. Such variants are applicable.
Within the case of 26B A4B: Extra sources are required on this mannequin. It makes use of about 16GB or above of VRAM. That is applicable to the high-end laptops or workstations.
Within the case of 31B: Probably the most resource-intensive variant is that this one. It requires 24GB or above of VRAM. That is the power of Apple Silicon Macs (M1/M2/M3/M4). These fashions benefit from the benefit of getting a standard reminiscence construction.

Operating and Testing the Mannequin

Run the mannequin out of your terminal. Use the ollama run command.

ollama run gemma4:e2b (Substitute e2b along with your chosen variant).

The mannequin will load. You may then enter prompts.

Instance Prompts:

Textual content Era: “Write a brief poem concerning the ocean.”
Coding Query: “Clarify tips on how to kind an inventory in Python.”
Reasoning/Summarization: “Summarize the important thing factors of local weather change in two sentences.”

Observe the response instances. The larger fashions are slower. It’s straightforward to work together with Gemma 4 domestically with Ollama.

Arms-on Undertaking Growth with Claude Code CLI and Gemma 4

We’re going to create a Second Mind that’s powered by AI. This venture offers solutions to your native information. It additionally summarizes paperwork. This half demonstrates its growth. Claude Code CLI can be our coding assistant. Notably, we will set Claude Code CLI to work with Gemma 4 domestically and Ollama as its massive language mannequin. This renders our complete growth and venture native and personal.

Establishing Claude Code CLI to make use of Gemma 4

Claude Code CLI is an agentic coding software. It operates immediately in your terminal. It helps with code technology, debugging, and refactoring.

Set up:
Claude Code CLI works on macOS (10.15+), Linux, and Home windows (10+ through WSL/Git Bash). It wants a minimal of 4GB RAM. 8GB or extra is healthier.

For macOS and Linux, the really helpful native installer is:

curl -fsSL  | bash

For macOS customers, Homebrew is an possibility:

brew set up --cask claude-code

Connecting Claude Code CLI to Gemma 4 through Ollama:
After putting in Claude Code CLI and pulling your required gemma4 mannequin with Ollama, you’ll be able to launch Claude Code CLI, instructing it to make use of Gemma 4:

ollama launch claude --model gemma4:e4b (Substitute e4b along with your chosen variant).

This command tells Claude Code CLI to direct its LLM requests to your native Ollama occasion, particularly utilizing the gemma4 mannequin you will have pulled. No Anthropic API key’s wanted when working on this totally native setup.

Arms-on Steps to Construct the “Second Mind”

We use Claude Code CLI to put in writing Python code. This code then interacts with Gemma 4 domestically with Ollama.

1. Undertaking Initialization & Construction with Claude Code CLI
Open your terminal. Navigate to your required venture listing. Guarantee Claude Code CLI is lively utilizing ollama launch claude --model gemma4:26b (Substitute e4b along with your chosen variant) command.

I’m utilizing gemma4:26b domestically with none cloud help, let’s see the way it goes.

Immediate Claude Code CLI to create the fundamental construction:

“Generate a Python venture construction for a "Second Mind" utility. Embrace directories for 'information', 'scripts', 'vector_store', and a major 'app.py' file.”

Generating a project structure on Claude code

It gave me a reply that no construction was created.

2. Doc Loader & Chunker Script (utilizing Claude Code CLI)
Now, we want a script to course of paperwork.

“Write a Python script in 'scripts/data_processor.py'. This script ought to use 'langchain_community.document_loaders' (particularly 'PyMuPDFLoader' for PDFs and 'TextLoader' for TXT) and 'langchain.text_splitter.RecursiveCharacterTextSplitter'. It hundreds paperwork from the 'information' listing. Chunks are 1000 characters with 100 overlap. Every chunk should retain its authentic supply and web page metadata. Make sure that the script handles a number of file sorts and returns the processed chunks as an inventory of LangChain 'Doc' objects.“

3. Embedding & Vector Retailer Script (utilizing Claude Code CLI)
Subsequent, we generate embeddings and save them.

“Create a Python script in 'scripts/vector_db_manager.py'. This script ought to take an inventory of LangChain 'Doc' objects. It generates embeddings utilizing the Ollama embedding mannequin ('OllamaEmbeddings' from 'langchain_community.embeddings', mannequin 'gemma4:e2b'). Then, it persists them right into a ChromaDB occasion within the 'vector_store' listing. It should even have a operate to load an present ChromaDB.”

4. RAG Question Operate (utilizing Claude Code CLI)
Now, for the core question-answering.

“Develop a Python operate in 'app.py' known as 'query_second_brain(query_text: str)'. This operate hundreds the ChromaDB. It retrieves the highest 3 related chunks. It then makes use of 'langchain_openai.ChatOpenAI' (configured for Ollama's API: 'base_url=" 'mannequin="gemma4:e2b"') to reply 'query_text' utilizing the retrieved chunks as context. Use a transparent RAG immediate construction. Present the total operate.”

5. Summarization Operate (utilizing Claude Code CLI)

Lastly, a summarization characteristic.

“Add a Python operate to 'app.py' known as 'summarize_document(file_path: str)'. This operate ought to load the doc, move its content material to the native Gemma 4 mannequin through Ollama, and return a concise abstract. Use an acceptable immediate for summarization.”

By means of every step, Claude Code CLI, powered by Gemma 4, generated the code. Gemma 4 itself carried out the core AI duties of the venture.

NOTE: On working the ultimate code that’s python app.py as prompt by Gemma4, I bumped into an error.

I attempted to repair it, offering the precise error for a number of iterations however this native mannequin was not capable of repair the code, the claude code broke a number of time by simply offering the abstract however no modifications.

In a single immediate it simply gave us the code content material requested to create a file by yourself.

We then determined to modify to the cloud model of gemma4:31b which is offered in Ollama cloud in a free tier. Simply use this command

ollama launch claude --model gemma4:31b-cloud

It should immediate you to check in on the browser simply do it and you might be able to code.

We gave it a easy immediate

❯ analyse the @second_brain/ venture and make a full plan to make the venture useful

Then the Gemma 4 31b cloud mannequin analysed the total venture, corrected each code. It took virtually 7 minutes to do that work but it surely accomplished every damaged code and verified the total working of the app.

Testing the app

On opening the app appears like this.

We uploaded a pattern textual content file and ran the ingestion pipeline.

Ingesting documents within the application

Now, let’s take a look at the chat characteristic utilizing the native Gemma4 mannequin and ollama native endpoint for answering:

The second brain responding about the document that we had provided

After quite a few iterations, I consider that working fashions domestically and utilizing them for code technology with a well-liked software like Claude Code nonetheless has a protracted method to go. Whereas native LLMs working on private PCs are promising, they face vital constraints concerning {hardware} necessities, inference latency, and intelligence limitations. In the end, to get advanced work completed effectively, we needed to swap again to cloud-deployed fashions.

Conclusion

From set up to exploring its totally different variants, it’s clear this mannequin household is constructed for sensible, real-world use. Operating it domestically offers you management over information, reduces dependency on exterior APIs, and opens the door to constructing quicker, extra personal workflows.

The Second Mind venture highlights what’s doable if you mix Gemma 4 with instruments like Claude Code CLI. This hybrid setup blends sturdy reasoning with environment friendly growth, making it simpler to construct clever programs that work in manufacturing.

Continuously Requested Questions

Q1. What are the important thing benefits of working Gemma 4 in Ollama?

A. When used domestically with Ollama, Gemma 4 ensures privateness of information, lowers API charges, and provides offline entry to sturdy AI companies.

Q2. What’s the most applicable Gemma 4 model to me on my Mac?

A. Gemma 4 variation is essentially the most appropriate variant relying on the RAM of your Mac. E2B/E4B go well with 8GB+ RAM. 26B/31B variants want 16GB-24GB+ VRAM.

Q3. What’s a Second Mind powered by AI?

A. A private information system is an AI-powered Second Mind. It solutions questions and summarizes native paperwork utilizing native LLMs.

Harsh Mishra is an AI/ML Engineer who spends extra time speaking to Massive Language Fashions than precise people. Enthusiastic about GenAI, NLP, and making machines smarter (in order that they don’t exchange him simply but). When not optimizing fashions, he’s most likely optimizing his espresso consumption. 🚀☕