Is it possible to run on only a cpu? #38

DrewThomasson · 2025-02-26T21:46:34Z

I know in the readme it says

"
Requirements:

Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100)
30GB of free disk space
"

But it also says

"Install sglang with flashinfer if you want to run inference on GPU."

Does that imply that it can be run on a cpu only? (albeit a bit slow.)

Thanks! :)

jakep-allenai · 2025-02-27T04:24:55Z

At the moment, it's not possible via pipeline.py, but you can do it if you just infer the model directly.

See: https://huggingface.co/allenai/olmOCR-7B-0225-preview

The model card has a code sample on how to call the model, which will work (slowly) on CPU. But you lose the advantages of the pipeline.py method like retries and output verification etc.

DrewThomasson · 2025-02-27T04:33:46Z

hm

Then I shall Wait

Thx for the detailed response! :)

Edit: testing this running on cpu only on my Mac M1 Pro 16b gb ram rn

DrewThomasson · 2025-02-27T05:19:30Z

Confirmed to work on CPU through the script you pointed me to! :D

(took a while tho lol)

Sadly the output appears truncated so something may of gone wrong looking it it...

(base) drew@wmughal-CN4D09397T test % python test.py      
Loading checkpoint shards: 100%|████████████████████████████| 4/4 [00:00<00:00,  6.16it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
['{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\\nOpen Weights and Open Data\\nfor State-of-the']
(base) drew@wmughal-CN4D09397T test %

DrewThomasson · 2025-02-27T05:24:44Z

Running this modified script:

import torch
import base64
import urllib.request
import json
import time
from io import BytesIO
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

from olmocr.data.renderpdf import render_pdf_to_base64png
from olmocr.prompts import build_finetuning_prompt
from olmocr.prompts.anchor import get_anchor_text

# Start time tracking
start_time = time.time()

# Initialize the model
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "allenai/olmOCR-7B-0225-preview", torch_dtype=torch.bfloat16
).eval()
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Grab a sample PDF
pdf_path = "./paper.pdf"
urllib.request.urlretrieve("https://molmo.allenai.org/paper.pdf", pdf_path)

# Render page 1 to an image
image_base64 = render_pdf_to_base64png(pdf_path, 1, target_longest_image_dim=1024)

# Build the prompt using document metadata
anchor_text = get_anchor_text(pdf_path, 1, pdf_engine="pdfreport", target_length=4000)
prompt = build_finetuning_prompt(anchor_text)

# Build the full prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
        ],
    }
]

# Apply the chat template and processor
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
main_image = Image.open(BytesIO(base64.b64decode(image_base64)))

# Prepare inputs for model
inputs = processor(
    text=[text],
    images=[main_image],
    padding=True,
    return_tensors="pt",
)
inputs = {key: value.to(device) for (key, value) in inputs.items()}

# Generate the output
output = model.generate(
    **inputs,
    temperature=0.8,
    max_new_tokens=200,  # Increased to avoid truncation
    num_return_sequences=1,
    do_sample=True,
)

# Decode the output
prompt_length = inputs["input_ids"].shape[1]
new_tokens = output[:, prompt_length:]
text_output = processor.tokenizer.batch_decode(new_tokens, skip_special_tokens=True)

# End time tracking
end_time = time.time()
processing_time = end_time - start_time  # Time taken for execution

# Save output to text file
output_text_path = "output.txt"
with open(output_text_path, "w", encoding="utf-8") as f:
    f.write(text_output[0])  # Save the first element as text

# Try saving output as JSON if possible
output_json_path = "output.json"
try:
    parsed_output = json.loads(text_output[0])  # Try parsing as JSON
    with open(output_json_path, "w", encoding="utf-8") as f:
        json.dump(parsed_output, f, indent=4)
    print(f"Output successfully saved as JSON: {output_json_path}")
except json.JSONDecodeError:
    print("Output is not valid JSON, saved as plain text.")

# Print output & processing time
print("\nGenerated Output:\n", text_output[0])
print(f"\nProcessing Time: {processing_time:.2f} seconds")

# Confirm file saving
print(f"\nOutput saved to {output_text_path} and {output_json_path}")

DrewThomasson · 2025-02-27T15:23:56Z

Testing Result

(base) drew@wmughal-CN4D09397T test % python test.py
Loading checkpoint shards: 100%|████████████████████████████| 4/4 [00:00<00:00,  5.95it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Output is not valid JSON, saved as plain text.

Generated Output:
 {"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\nOpen Weights and Open Data\nfor State-of-the-Art Multimodal Models\n\nMatt Deitke∗†ψ Christopher Clark∗† Sangho Lee† Rohun Tripathi† Yue Yang†\nJae Sung Parkψ Mohammadreza Salehiψ Niklas Muennighoff† Kyle Lo† Luca Soldaini†\nJiasen Lu† Taira Anderson† Erin Bransom† Kiana Ehsani† Huong Ngo†\nYenSung Chen† Ajay Patel† Mark Yatskar† Chris Callison-Burch† Andrew Head†\nRose Hendrix† Favyen Bastani† Eli VanderBilt† Nathan Lambert† Yvonne Chou†\nArnavi Chheda† Jenna Sparks† Sam

Processing Time: 3249.81 seconds

Output saved to output.txt and output.json
(base) drew@wmughal-CN4D09397T test %

jakep-allenai · 2025-02-27T20:47:35Z

Almost an hour to process a page, yikes!

DrewThomasson · 2025-02-27T22:28:36Z

yup, and it didn't even generate the txt of the full thing, only got like a paragraph out of the model

Perhaps it can be quantized or something and run with llama cpp, But I don't know if its a vision model or not so🤷

DrewThomasson mentioned this issue Feb 27, 2025

macOS support #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to run on only a cpu? #38

Is it possible to run on only a cpu? #38

DrewThomasson commented Feb 26, 2025

jakep-allenai commented Feb 27, 2025

DrewThomasson commented Feb 27, 2025 •

edited

Loading

DrewThomasson commented Feb 27, 2025 •

edited

Loading

DrewThomasson commented Feb 27, 2025

DrewThomasson commented Feb 27, 2025

jakep-allenai commented Feb 27, 2025

DrewThomasson commented Feb 27, 2025 •

edited

Loading

Is it possible to run on only a cpu? #38

Is it possible to run on only a cpu? #38

Comments

DrewThomasson commented Feb 26, 2025

jakep-allenai commented Feb 27, 2025

DrewThomasson commented Feb 27, 2025 • edited Loading

DrewThomasson commented Feb 27, 2025 • edited Loading

DrewThomasson commented Feb 27, 2025

DrewThomasson commented Feb 27, 2025

Testing Result

jakep-allenai commented Feb 27, 2025

DrewThomasson commented Feb 27, 2025 • edited Loading

DrewThomasson commented Feb 27, 2025 •

edited

Loading

DrewThomasson commented Feb 27, 2025 •

edited

Loading

DrewThomasson commented Feb 27, 2025 •

edited

Loading