Extra <p> tags when compiling: MultiMarkdown -> Pandoc

ptmkenny · May 31, 2018, 4:41pm

I seem to be getting extra p tags when compiling and I’m not sure why.

I gave the text a style that is set to “treat as raw markup.” This is my document in Scrivener:

[code]COW This is a test and it should be inside a div.
There should be no P tag surrounding this. END

COW This is a test and it should be inside a div.
There should be no P tag surrounding this.
COW2 There should be no P tag surrounding this.
COW2 There should be no P tag surrounding this, either. END

But this should be inside a p tag.[/code]

The replacement patterns look like this:

COW <div class="speechbubble chat-cow"><p class="speechicon">![cow](1f42e.svg)</p> COW2 </div><div class="speechbubble chat-cow"><p class="speechicon">![cow](1f42e.svg)</p> END </div>

I am compiling to MultiMarkdown using Pandoc processing. This is the result:

[code]

cow

This is a test and it should be inside a div. There should be no P tag surrounding this.

cow

This is a test and it should be inside a div. There should be no P tag surrounding this.

cow

There should be no P tag surrounding this.

cow

There should be no P tag surrounding this, either.

But this should be inside a p tag.

[/code]

I expected to see this (no extra p tags):

[code]

cow

This is a test and it should be inside a div. There should be no P tag surrounding this.

cow

This is a test and it should be inside a div. There should be no P tag surrounding this.

cow

There should be no P tag surrounding this.

cow

There should be no P tag surrounding this, either.

But this should be inside a p tag.

[/code]

I expected to see this because Markdown is (AFAIK) supposed to treat single carriage returns as part of the same paragraph, but here it is treating each single carriage return as its own paragraph. I’m attaching a sample project file if it’s easier to look directly at the code.
AnimalReplacements.scriv.zip (81.8 KB)

KB · May 31, 2018, 4:54pm

I recommend ticking “Save source files in a folder with exported file”, so that you can see the actual text file that is being passed to Pandoc. If you do this, you will see that your text contains lots of extraneous tags, which is presumably causing the problem. So it looks as though you may need to revisit the replacement patterns (or someone else, such as Ioa, may come along who is more of a RegEx expert than I am).

All the best,
Keith

ptmkenny · May 31, 2018, 5:56pm

Thanks Keith. Those broken tags were actually the result of another page that I was also testing in an attempt to debug what was actually going on.

Removing the page has no effect; the extra

tags are still there.

Now, though, I see that pandoc treats groups of lines with HTML in them differently than lines of pure text.

For example, the

tags are inserted incorrectly for this:

[code]

cow

This is a test and it should be inside a div.
There should be no P tag surrounding this.

![cow](1f42e.svg)

There should be no P tag surrounding this.

![cow](1f42e.svg)

There should be no P tag surrounding this, either.

[/code]

But inserted correctly for this (I used NOREPLACE to prevent replacement):

NOREPLACE This is a test and it should be inside a div. There should be no P tag surrounding this. NOREPLACE There should be no P tag surrounding this. NOREPLACE There should be no P tag surrounding this, either. NOREPLACE

So in that case, it looks like the problem is with using Pandoc. To me, it looks like the easiest approach to fix this will be to write a script that runs through the document and looks for instances in which there are consecutive lines of text with a carriage return in between them; if such a line is found, replace it with a space. I’ll try digging into this later and report back what I find.
AnimalReplacements2.scriv.zip (82.7 KB)

nontroppo · May 31, 2018, 8:07pm

HTML blocks are treated specifically in Markdown. Pandoc manual says this:

pandoc.org/MANUAL.html#raw-html

As I understand it you need to have balanced tags, and in the case where you see extra

it is because the closing tag is not on the right line?

Commonmark goes into excruciating detail about what constitutes a HTML block, not sure if this is exactly what Pandoc does (it uses its own parser):

spec.commonmark.org/0.28/#html-blocks

There is also the {=html5} method to “label” the HTML chunks specifically (pandoc.org/MANUAL.html#generic-raw-attribute), BUT then you can’t process your markdown embedded within (i.e. your images wont’ expand properly).

ptmkenny · June 3, 2018, 5:18pm

@Nontroppo Thanks, that’s a very clear explanation. My problem was that the tag was not “balanced” as you stated because it was on the next line down.

I solved this by running a perl script on the output that automatically combines any lines that start with </ with the previous line.

I’m attaching the script in case it’s useful to anyone else.
remove-r.pl.zip (970 Bytes)

nontroppo · June 4, 2018, 1:56am

If you use pandocomatic to run pandoc, it can automate the running of the perl and other scripts for you… pandocomatic has the notion of running [setup], [preprocessor], [postprocessor] and [cleanup] scripts. [setup/cleanup] are useful for running general scripts to move files about. [pre/postprocessor] works on the document source only, to e.g. automate regex replacements of words etc. (this is in addition to Pandoc filters, which work directly on the AST representation of the document).