Email with HTML to Text parsing

(6 posts) (2 voices)

Tags:

  1. duane, Member

    Hi.

    I've noticed the email HTML to text parsing for the plain text part of emails often misses links (removes them completely), doesn't covert bullet lists well, and more. What does mydbr use to parse HMTL to text?

    reStructuredText is a fairly good approach and there is a parser for php at https://github.com/doctrine/rst-parser/ and another older one at https://github.com/Gregwar/RST (and there might be others and there is pandoc).

    Is there a chance to implement HTML to reStructuredText conversion to make the plain text version of the email better? (when going into some systems, the plan text version of the HTML email is almost unreadable and missing key links).

  2. myDBR Team, Key Master

    Just to be clear. You are sending HTML mail from myDBR (including links and bullet lists) and the plain text version of the mail does not show up well enough?

    --
    myDBR Team

  3. duane, Member

    Correct: Sending HTML email from mydbr (latest version).

    Bullet lists get collapsed to:
    Line 1
    Line 2
    Line 3

    Which makes them hard to read vs. RST which would be something like
    - Line 1
    - Line 2
    - Line 3
    OR
    * Line 1
    * Line 2
    * Line 3

    An html link (but more complex in real life) like My link text gets reduced to:
    My link text. (with no url)

    in RST it might be something like: `My link text <https://mydomain.com/myurl>`_. (there are other variants too)

    Paragraphs blend vs have line spacing so something like:
    <p>Paragraph 1</p>
    <p>Paragraph 2</p>
    (which usually displays with an extra carriage return between paragraphs)

    Would become:
    Paragraph 1
    Paragraph 2

    Plus line breaks get removed without processing the single line return so
    Beforebreak
    Afterbreak
    becomes
    BeforebreakAfterbreak

    I definitely know HTML to plain text parsing will never be perfect and there are workarounds I can add (\n) to help. But since RST exists and seems to have php/server implementations, I thought I'd suggest it ;-)

    If you are wondering who even see plain text, processing systems often do. I use mailman3 and send reports to a list. The HTML email goes through to all recipients set to receive email, but the archive stores the plain text version.

  4. myDBR Team, Key Master

    Duane,
    the libraries above are meant for converting reStructuredText documents to HTML, not another way around.

    myDBR uses PHPMailer which basically just strips HTML out to produce the plain text version of the mail. To have the plain text version to contain RST version of the HTML, one could use the swiss.army-knife Pandoc. We can take a look if we could add support for Pandoc in mail, so you have the plain text version of the mail to contain RTS (or any other text based format).

    --
    myDBR Team

  5. duane, Member

    Pandoc sounds promising. Thanks for the update...I'll keep trying my workaround in the meantime!

  6. myDBR Team, Key Master

    Duane,
    you can now postprocess the mail being sent. The default postprocessor is Pandoc. See the command dbr.mail.postprocess from the documentation for more info.

    --
    myDBR Team


Reply

You must log in to post.