Fix: page_chars attribute does not exist in some formats of PDF #3796

cyhasuka · 2024-12-02T07:01:27Z

What problem does this PR solve?

In #3335 someone suggested to upgrade pdfplumber==0.11.1, but that didn't solve it.
It's actually the special formatting in some of the pdfs that triggers the problem.

Type of change

Bug Fix (non-breaking change which fixes an issue)

KevinHuSh · 2024-12-02T11:03:33Z

deepdoc/parser/pdf_parser.py

@@ -956,8 +956,12 @@ def __images__(self, fnm, zoomin=3, page_from=0,
                                enumerate(self.pdf.pages[page_from:page_to])]
            self.page_images_x2 = [p.to_image(resolution=72 * zoomin * 2).annotated for i, p in
                                enumerate(self.pdf.pages[page_from:page_to])]
-            self.page_chars = [[{**c, 'top': c['top'], 'bottom': c['bottom']} for c in page.dedupe_chars().chars if self._has_color(c)] for page in
-                               self.pdf.pages[page_from:page_to]]
+            try:


I did not get it. Why try is in another try?

The second try exists to catch and handle exceptions that may occur during parsing of page_chars, so that parsing of the entire PDF does not fail due to parsing problems on individual pages. In the original code, if an exception occurs in page_chars, it is not handled but simply reported as an error and exited.

Appreciation!

cyhasuka added 9 commits November 18, 2024 16:04

Fix: description err

be47ca8

Fix: description err

ab324b1

Fix: description err

dd2da7d

Fix: description err

0417fe8

Fix: description err

c5b99ed

Merge branch 'infiniflow:main' into main

abeddb4

Merge branch 'infiniflow:main' into main

70805b7

Merge branch 'infiniflow:main' into main

488725b

Fix: Pdf object has no attribute page_chars

e3b50c2

KevinHuSh reviewed Dec 2, 2024

View reviewed changes

KevinHuSh merged commit 7b6a5ff into infiniflow:main Dec 3, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: page_chars attribute does not exist in some formats of PDF #3796

Fix: page_chars attribute does not exist in some formats of PDF #3796

cyhasuka commented Dec 2, 2024

KevinHuSh Dec 2, 2024

cyhasuka Dec 3, 2024

KevinHuSh Dec 3, 2024

Fix: page_chars attribute does not exist in some formats of PDF #3796

Fix: page_chars attribute does not exist in some formats of PDF #3796

Conversation

cyhasuka commented Dec 2, 2024

What problem does this PR solve?

Type of change

KevinHuSh Dec 2, 2024

Choose a reason for hiding this comment

cyhasuka Dec 3, 2024

Choose a reason for hiding this comment

KevinHuSh Dec 3, 2024

Choose a reason for hiding this comment