Skip to content
This repository has been archived by the owner on Jun 26, 2020. It is now read-only.

Incomplete abstract #23

Open
nleguillarme opened this issue Jul 31, 2019 · 6 comments
Open

Incomplete abstract #23

nleguillarme opened this issue Jul 31, 2019 · 6 comments

Comments

@nleguillarme
Copy link

While iterating on articles resulting from a PubMed query, I also noticed that the abstract is sometimes incomplete :

For instance :
Query : ((Haliaeetus leucocephalus[Title/Abstract])) AND ((prey[Title/Abstract]) OR (diet[Title/Abstract]))

Returns (when printing first 10 results) :
pubmed_id = '31015971'
abstract = 'Bald eagle ('

@mbullmanFHCRC mbullmanFHCRC mentioned this issue Aug 9, 2019
4 tasks
@Keramatfar
Copy link

The problem is that the package get the text of abstracttext tag and in the example there is some html tag there. see here:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=31015971

@Keramatfar
Copy link

I changed the line 156 in api.py to:
response = re.sub('<[/ ]*[a-z]{1,3}>', '', str(response.text))
return response

@iacopy
Copy link

iacopy commented Mar 15, 2020

Many abstracts and titles are truncated.

iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
as well as multiple ids issues.
Closes gijswobben#23

In some cases the title and/or abstract obtained was incomplete.

Example: PMID 31689885
TItle tag:
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]')
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Result was: 'The effect of '

Also, this record returned a series of ids instead of just one.
iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
as well as multiple ids issues.
Closes gijswobben#23

In some cases the title and/or abstract obtained was incomplete,
and wrong ids were returned instead of the paper pubmed id.

Example: PMID 31689885
TItle tag:
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]')
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Result was: 'The effect of '

Also, this record returned a series of ids instead of just one.

Solution: cleanup of html markup tags like <i> and <sub>.
iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
Closes gijswobben#23

In some cases the title and/or abstract obtained was incomplete.

Example: PMID 31689885
TItle tag:
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]')
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Result was: 'The effect of '

Solution: cleanup of html markup tags such as <i>, <sub>, <sup>.
iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
Closes gijswobben#23

In some cases the title and/or abstract obtained was incomplete.

Example: PMID 31689885
TItle tag:
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]')
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Result was: 'The effect of '

Solution: cleanup of html markup tags such as <i>, <sub>, <sup>.
iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
Closes gijswobben#23

In some cases the title and/or abstract obtained was incomplete.

Example: PMID 31689885
TItle tag:
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]')
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Result was: 'The effect of '

Solution: cleanup of html markup tags such as <i>, <sub>, <sup>.
iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
Closes gijswobben#23

In some cases the title and/or abstract obtained was incomplete.

Example: PMID 31689885
TItle tag:
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]')
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Result was: 'The effect of '

Solution: cleanup of html markup tags such as <i>, <sub>, <sup>.
iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
Closes gijswobben#23

In some cases the title and/or abstract obtained was incomplete.

Example: PMID 31689885
TItle tag:
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]')
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Result was: 'The effect of '

Solution: cleanup of html markup tags such as <i>, <sub>, <sup>.
iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
In some cases the title and/or abstract obtained was incomplete.

This happens when the text contains html markup tags
(<i>, <sub>, <sup>, ...).

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of frequently used html markup tags <i>, <sub>, <sup>.
It seems to fix gijswobben#23 correctly, at least for the above mentioned tags.
iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
In some cases the title and/or abstract obtained was incomplete.

This happens when the text contains html markup tags
(<i>, <sub>, <sup>, ...).

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of frequently used html markup tags <i>, <sub>, <sup>.
It seems to fix gijswobben#23 correctly, at least for the above mentioned tags.
iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
@iacopy
Copy link

iacopy commented Mar 22, 2020

I changed the line 156 in api.py to:
response = re.sub('<[/ ]*[a-z]{1,3}>', '', str(response.text))
return response

It seems a useful solution for most of articles.
But not good for math articles with `<mml:math ....> tags.
Anyway I suggest to merge this one to have a significative fix so far.

iacopy added a commit to iacopy/pymed that referenced this issue Mar 23, 2020
In some cases the title and/or abstract obtained was incomplete.

This happens when the text contains html markup tags.
The most frequent ones seems to be (in descending order):
<i>, <sub>, <sup>, <b>, <mml:*>, ...?.

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of tags.
It seems to fix gijswobben#23 correctly, at least for non-mml tags.
NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
iacopy added a commit to iacopy/pymed that referenced this issue Mar 26, 2020
In some cases the title and/or abstract obtained was incomplete.

This happens when the text contains html markup tags.
The most frequent ones seems to be (in descending order):
<i>, <sub>, <sup>, <b>, <mml:*>, ...?.

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of tags.
It seems to fix gijswobben#23 correctly, at least for non-mml tags.
NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
iacopy added a commit to iacopy/pymed that referenced this issue Apr 12, 2020
In some cases the title and/or abstract obtained was incomplete
(issue gijswobben#23).

This happens when the text contains html markup tags.
The most frequent ones seems to be (in descending order):
<i>, <sub>, <sup>, <b>, <mml:*>, ...?.

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of tags.
It seems to fix gijswobben#23 correctly, at least for non-mml tags.
NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
iacopy added a commit to iacopy/pymed that referenced this issue Apr 12, 2020
In some cases the title and/or abstract obtained was incomplete
(issue gijswobben#23).

This happens when the text contains html markup tags
(<b>, <i>, <sub>, <sup>, ...).

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of frequently used html markup tags like <b>, <i>, <sub>, <sup>.
It seems to fix most of papers correctly, at least for the above mentioned tags.
iacopy added a commit to iacopy/pymed that referenced this issue Apr 12, 2020
In some cases the title and/or abstract obtained was incomplete
(issue gijswobben#23).

This happens when the text contains html markup tags.
The most frequent ones seems to be (in descending order):
<i>, <sub>, <sup>, <b>, <mml:*>, ...?.

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of tags.
It seems to fix gijswobben#23 correctly, at least for non-mml tags.
NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
@vectorkt
Copy link

@iacopy

This issue still occurs. I installed pymed through pip as suggested here:

https://pypi.org/project/pymed/

Is the pip package up to date? Should I clone the git directly instead?

Or is this issue not fixed overall?

@iacopy
Copy link

iacopy commented Apr 29, 2020

@vectorkt yes, the the pip package is not updated, since these merge requests are not merged, so the issue still occurs. This repo seems currently abandoned.
If you want some fixes (correct PMIDs, non-truncated texts, only-english abstracts, ...) you can use my fork branch fork-fixes. You can try in your virtualenv pip install -e git://github.com/iacopy/pymed.git@fork-fixes#egg=pymed, preceded by pip install requests if needed. I'm actually using this.
Let me know.

iacopy added a commit to iacopy/pymed that referenced this issue Feb 25, 2022
In some cases the title and/or abstract obtained was incomplete
(issue gijswobben#23).

This happens when the text contains html markup tags.
The most frequent ones seems to be (in descending order):
<i>, <sub>, <sup>, <b>, <mml:*>, ...?.

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of tags.
It seems to fix gijswobben#23 correctly, at least for non-mml tags.
NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
iacopy added a commit to iacopy/pymed that referenced this issue Feb 25, 2022
In some cases the title and/or abstract obtained was incomplete
(issue gijswobben#23).

This happens when the text contains html markup tags.
The most frequent ones seems to be (in descending order):
<i>, <sub>, <sup>, <b>, <mml:*>, ...?.

Example: PMID 31689885
<ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle>
Before the fix the returned title was just: 'Gamma Irradiated '
<AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle>
Before the fix the returned abstract was just: 'The effect of '

Fastest solution found: cleanup of tags.
It seems to fix gijswobben#23 correctly, at least for non-mml tags.
NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants