-
Notifications
You must be signed in to change notification settings - Fork 113
Incomplete abstract #23
Comments
The problem is that the package get the text of abstracttext tag and in the example there is some html tag there. see here: |
I changed the line 156 in api.py to: |
Many abstracts and titles are truncated. |
as well as multiple ids issues. Closes gijswobben#23 In some cases the title and/or abstract obtained was incomplete. Example: PMID 31689885 TItle tag: <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]') <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Result was: 'The effect of ' Also, this record returned a series of ids instead of just one.
as well as multiple ids issues. Closes gijswobben#23 In some cases the title and/or abstract obtained was incomplete, and wrong ids were returned instead of the paper pubmed id. Example: PMID 31689885 TItle tag: <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]') <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Result was: 'The effect of ' Also, this record returned a series of ids instead of just one. Solution: cleanup of html markup tags like <i> and <sub>.
Closes gijswobben#23 In some cases the title and/or abstract obtained was incomplete. Example: PMID 31689885 TItle tag: <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]') <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Result was: 'The effect of ' Solution: cleanup of html markup tags such as <i>, <sub>, <sup>.
Closes gijswobben#23 In some cases the title and/or abstract obtained was incomplete. Example: PMID 31689885 TItle tag: <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]') <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Result was: 'The effect of ' Solution: cleanup of html markup tags such as <i>, <sub>, <sup>.
Closes gijswobben#23 In some cases the title and/or abstract obtained was incomplete. Example: PMID 31689885 TItle tag: <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]') <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Result was: 'The effect of ' Solution: cleanup of html markup tags such as <i>, <sub>, <sup>.
Closes gijswobben#23 In some cases the title and/or abstract obtained was incomplete. Example: PMID 31689885 TItle tag: <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]') <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Result was: 'The effect of ' Solution: cleanup of html markup tags such as <i>, <sub>, <sup>.
Closes gijswobben#23 In some cases the title and/or abstract obtained was incomplete. Example: PMID 31689885 TItle tag: <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Result was: 'Gamma Irradiated ' (now is 'Gamma Irradiate Rhodiola sachalinensis Extract[...]') <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Result was: 'The effect of ' Solution: cleanup of html markup tags such as <i>, <sub>, <sup>.
In some cases the title and/or abstract obtained was incomplete. This happens when the text contains html markup tags (<i>, <sub>, <sup>, ...). Example: PMID 31689885 <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Before the fix the returned title was just: 'Gamma Irradiated ' <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Before the fix the returned abstract was just: 'The effect of ' Fastest solution found: cleanup of frequently used html markup tags <i>, <sub>, <sup>. It seems to fix gijswobben#23 correctly, at least for the above mentioned tags.
In some cases the title and/or abstract obtained was incomplete. This happens when the text contains html markup tags (<i>, <sub>, <sup>, ...). Example: PMID 31689885 <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Before the fix the returned title was just: 'Gamma Irradiated ' <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Before the fix the returned abstract was just: 'The effect of ' Fastest solution found: cleanup of frequently used html markup tags <i>, <sub>, <sup>. It seems to fix gijswobben#23 correctly, at least for the above mentioned tags.
It seems a useful solution for most of articles. |
In some cases the title and/or abstract obtained was incomplete. This happens when the text contains html markup tags. The most frequent ones seems to be (in descending order): <i>, <sub>, <sup>, <b>, <mml:*>, ...?. Example: PMID 31689885 <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Before the fix the returned title was just: 'Gamma Irradiated ' <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Before the fix the returned abstract was just: 'The effect of ' Fastest solution found: cleanup of tags. It seems to fix gijswobben#23 correctly, at least for non-mml tags. NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
In some cases the title and/or abstract obtained was incomplete. This happens when the text contains html markup tags. The most frequent ones seems to be (in descending order): <i>, <sub>, <sup>, <b>, <mml:*>, ...?. Example: PMID 31689885 <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Before the fix the returned title was just: 'Gamma Irradiated ' <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Before the fix the returned abstract was just: 'The effect of ' Fastest solution found: cleanup of tags. It seems to fix gijswobben#23 correctly, at least for non-mml tags. NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
In some cases the title and/or abstract obtained was incomplete (issue gijswobben#23). This happens when the text contains html markup tags. The most frequent ones seems to be (in descending order): <i>, <sub>, <sup>, <b>, <mml:*>, ...?. Example: PMID 31689885 <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Before the fix the returned title was just: 'Gamma Irradiated ' <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Before the fix the returned abstract was just: 'The effect of ' Fastest solution found: cleanup of tags. It seems to fix gijswobben#23 correctly, at least for non-mml tags. NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
In some cases the title and/or abstract obtained was incomplete (issue gijswobben#23). This happens when the text contains html markup tags (<b>, <i>, <sub>, <sup>, ...). Example: PMID 31689885 <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Before the fix the returned title was just: 'Gamma Irradiated ' <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Before the fix the returned abstract was just: 'The effect of ' Fastest solution found: cleanup of frequently used html markup tags like <b>, <i>, <sub>, <sup>. It seems to fix most of papers correctly, at least for the above mentioned tags.
In some cases the title and/or abstract obtained was incomplete (issue gijswobben#23). This happens when the text contains html markup tags. The most frequent ones seems to be (in descending order): <i>, <sub>, <sup>, <b>, <mml:*>, ...?. Example: PMID 31689885 <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Before the fix the returned title was just: 'Gamma Irradiated ' <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Before the fix the returned abstract was just: 'The effect of ' Fastest solution found: cleanup of tags. It seems to fix gijswobben#23 correctly, at least for non-mml tags. NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
This issue still occurs. I installed pymed through pip as suggested here: https://pypi.org/project/pymed/ Is the pip package up to date? Should I clone the git directly instead? Or is this issue not fixed overall? |
@vectorkt yes, the the pip package is not updated, since these merge requests are not merged, so the issue still occurs. This repo seems currently abandoned. |
In some cases the title and/or abstract obtained was incomplete (issue gijswobben#23). This happens when the text contains html markup tags. The most frequent ones seems to be (in descending order): <i>, <sub>, <sup>, <b>, <mml:*>, ...?. Example: PMID 31689885 <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Before the fix the returned title was just: 'Gamma Irradiated ' <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Before the fix the returned abstract was just: 'The effect of ' Fastest solution found: cleanup of tags. It seems to fix gijswobben#23 correctly, at least for non-mml tags. NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
In some cases the title and/or abstract obtained was incomplete (issue gijswobben#23). This happens when the text contains html markup tags. The most frequent ones seems to be (in descending order): <i>, <sub>, <sup>, <b>, <mml:*>, ...?. Example: PMID 31689885 <ArticleTitle>Gamma Irradiated <i>Rhodiola sachalinensis</i> Extract Ameliorates [...]</ArticleTitle> Before the fix the returned title was just: 'Gamma Irradiated ' <AbstractText>The effect of <i>Rhodiola sachalinensis</i> Boriss extract irradiated [...]</ArticleTitle> Before the fix the returned abstract was just: 'The effect of ' Fastest solution found: cleanup of tags. It seems to fix gijswobben#23 correctly, at least for non-mml tags. NB: cleaning of nested <mml:*> tags can result in multiple blanklines.
While iterating on articles resulting from a PubMed query, I also noticed that the abstract is sometimes incomplete :
For instance :
Query : ((Haliaeetus leucocephalus[Title/Abstract])) AND ((prey[Title/Abstract]) OR (diet[Title/Abstract]))
Returns (when printing first 10 results) :
pubmed_id = '31015971'
abstract = 'Bald eagle ('
The text was updated successfully, but these errors were encountered: