Cross-referencing MMD files

drd · March 21, 2011, 3:46am

Dear AmberV, aka Ioa,

I’m resurrecting a specific point from an old thread, so I thought it might be best to start fresh. (My comments are directed at AmberV, though others here might wish to contribute, too.)

First, let me express my gratitude. I have recently completed a major writing project, and I’m starting into another but with the goal of having a better handle on my electronic notes from day one. Your long history of threads here on MMD and the archiving of notes has proved enormously helpful. I’ve been studying your file naming system (+ meta-data) carefully for a while, returning to key posts as I refine and adapt the system for my own use.

One question I have stems from a comment that dates to 2007, where you explain a system of cross-referencing. I have a few questions and would appreciate clarification.

… I always put the target reference in a top level header at the beginning of the document. Generally I create a stylesheet with:
body h1:first-child { display: none; }
As a rule which hides this first header from display in a browser. You don’t need brackets in the name. The use of a simple [bracketLink] in MMD is just a shorthand method for [bracketLink]. They both do the same thing. The idea is, if you are not going to be needing to alias the link with another phrase, you might as well reduce complexity.

So [ic07193687] will create a link to a spot in the merged file that corresponds with:
# ic07193687 #
Even if the stylesheet is currently hiding the visible component from display, it will still work.

I put the two letter identification bit in front because the XHMLT spec requires a non-numerical character in the first position of an id. Otherwise, I’d just use the number. I suppose I could just use a generic two letter thing like, ‘id07193687’, but I figured why not indicate if the linked item is going to be a short paragraph or two, or an entire article.

Another alternative, if one wanted to create a network of thoughts instead of merged pages, would be to create a link to where the other file would be. If they are all going into the same web directory, [Index Card on Bamboo Fibers][./07194756-R-BambooFibers.html] would do the trick. In that case, you wouldn’t need to create a target link in the destination file at all.

Here’s some of what I’m wondering.

Is this hidden top-level header still a core part of your file processing system? Reading all your threads in a close space of time I do detect some changes over the years. Any improvements on this little technique?
I don’t know how much you use LaTeX, but is it just as easy to hide top level headers as it is with CSS? Probably there’s a way to do this with LaTeX (there usually is). Is it something you’ve pursued?
The last suggestion really intrigues me. It seems that you have worked out a way to create links between files in your INDEX files on your computer and not just across a web directory. It would be very useful to create an index card that did not need to be merged with other mmd files to have active links.

On that last point, actually, I gather that you have worked something out. Part of a 2009 comment shows you are doing some kind of black magic.

Each entry in the system has its own date and time and thus a unique way to link to it. Rather than provide the precise URI to the file in MMD linking syntax, I have a shortcut that just requires the unique ID, <|unique_id|>. My MultiMarkdown parser has been modified to interscept these codes and expand them out to full MMD syntax in combination with a file search routine to pinpoint the precise URI for me. Thus when I render the file to XHTML or whatever, it gets a link to the actual file on the system. Cross-referencing is mindless and flexible. If I change where I archive everything on the filesystem, I just adjust a variable in the script that adjust the URIs. The base documents themselves are therefore ignorant of file-system specific information. This is a big program with many of these applications in that their linking ability is absolute URI based. Move a file, and they get confused. Move everything, and your cross-references are useless.

Astonishingly clever. How exactly is the MMD parser modified to do this? I would love to get to the point where “Cross-referencing is mindless and flexible.”

Feel free to assume that I’ve got a system of file nomenclature and metadata, filed away into archival directories by years and quarters, that in many respects resembles what you have described elsewhere in this forum. Eg, in my files the top of this file looks roughly like this:

[code]Title: Cross-referencing MMD files
…
Date: 2011-03-20 23:53:25
GUID: 110320.2353
Keywords: {C2.1.mmd}

id110320.2353

Dear …[/code]
Cheers!

AmberV · March 22, 2011, 12:44am

Thanks! I’m glad to hear my ramblings have been of some use. Yeah, things evolve over the years. I never do sit still on any one technique, whether that is good or bad I don’t know.

On the first point, I no longer hide the first header. This always kind of bothered me anyway because it means data in the content area that is really honestly meta-data, not data. The document ID now goes into the Keywords field in all of my files.

I’ve also switched, as you noted, from relying on simple file merging to handle the problem where lots of smaller notes are meant to be part of one larger summary; mainly because I don’t physically merge things any more. I don’t find this terribly useful given how I record information, and prefer a “personal web” of inter-linked documents.

To get this into the MMD system, I decided to write my own script, and then adjust the helper scripts to add my script into the pipe that processes files, rather than try to add these features to the Perl scripts themselves. The easiest place to add this is in the Support.pm file. I added a variable at the top, accessible to all of the sub-routines, with [b]parse_links.rb |[/b] as its data, and then insert that variable wherever I saw a line like:

open (MultiMarkdown, "| cd \"$MMDPath\"; bin/MultiMarkdown.pl | bin/$SmartyPants $xslt $out");

Altering it to:

open (MultiMarkdown, "| cd \"$MMDPath\"; $custom_parsing bin/MultiMarkdown.pl | bin/$SmartyPants $xslt $out");

The key is to get it in before the main MultiMarkdown.pl script runs, as its job is to turn my shorthand into full MMD syntax.

Yes, in fact the above CSS snippet would do that with a little alteration. The first part of that is the “selector”. To hide all h1 elements indiscriminately, you would type this into the CSS file instead:

h1 { display:none; }

If you wanted to hide several levels:

h1, h2 { display:none; }

For the script itself, I have slimmed it down a bit. There were a lot of other things in it that are just personal additions. Note there is also support for a [b]<^filename^>[/b] notation, which just tacks on the root archive folder path. This is useful in the CSS field, for instance.

CSS: <^/machinery/css/article.css^>

Will expand out to a full file URL, which will cause the resulting HTML file to be styled according to the CSS stylesheet—and again preserve the advantage of not hard-linking to anything. So long as I keep the ‘machinery’ folder within the root archive folder, I’ll have friendly CSS displays when I view my files.

What I still have yet to do is find a way of MMD processing something when you click on the link. Of course, in any browser if you are viewing the HTML and click on a cross-reference, it takes you to the actual .md file, something the browser is not capable of handling—so it comes out as plain text. I toyed with the idea of “pre-rendering” anything a document links to and stashing them in a cache folder as HTML and then linking to that, but this could result in massive render times for larger documents with lots of links to other large documents—and how far should the chain go, what about the documents they link to?

Fortunately that is just a mild irritation. If I really need to see the linked document in HTML instead of MMD, I can just render that one too.

So to install the script, copy and paste the below into a file and save that into your MMD/bin folder. After doing that, change its executable bit to on, and edit the file’s configuration section to match your setup.

The script assumes you have a text file called [b]invalid_id.txt[/b] in the root archive. This is nothing fancy; just a text file that says the link you clicked on could not be associated with any file ID. So if you typo the ID, you’ll know what’s wrong. If you want to put this file some place else, just change line 18 to reflect that.

You can test the script prior to fully integrating it, by doing something like this on the command line:

cat input_file.md | ruby parse_links.rb > output_file.md

That assumes both the script and test file are in the same spot, of course. The product should be a valid MMD file with no custom mark-up. This is what gets invisibly passed to the full MultiMarkdown.pl script, once integrated.

Detailed technical notes on the actual syntax usage are in the comments at the top of the script.

One other thing of note: it doesn’t ignore verbatim or code blocks. It wouldn’t be too difficult to add that, but I’ve never had a need for putting cross-reference syntax into a code span.

[code]#!/usr/bin/ruby
require ‘find’

START CONFIGURATION

-----------------------------------------------------------------

home_dir = ENV[‘HOME’] ? ENV[‘HOME’] : ‘/Users/USERNAME’
$pre = ‘file://localhost’
$icfsBase = home_dir + ‘/PATH/TO/ARCHIVE_ROOT/FILES’
$arkBase = home_dir + ‘/PATH/TO/ARCHIVE_ROOT’
$error_file = $arkBase + ‘invalid_id.txt’

-----------------------------------------------------------------

END CONFIGURATION

Pre-load full ICFS list

icfsList =
Find.find($icfsBase) { |f| icfsList.push(f) }

def findICFS(strid, fileList)
res = fileList.find { |f| f =~ //#{strid}/ }
if res
return res.gsub(/\s/, ‘%20’)
else
return $error_file
end
end

Regular expressions

fileLink = /<^([^^]+)^>/
embeddedID = /([[^]]+])(<|\D{0,2}(\d{8}).?|>)/
bareID = /<|(\D{0,2})(\d{8}).?|>/
referredLink = /^([[^]]+]:\s)(<|\D{0,2}(\d{8}).*?|>)/

#inpt = File.open(‘test_data.txt’) { |fn| fn.readlines }
inpt = $stdin

process_input = inpt.readlines

process_input.each do |line|
case line
when referredLink
line.sub!(referredLink, $1 + $pre + findICFS($2, icfsList))
redo
when fileLink
# Fix file links
line.sub!(fileLink, $pre + $arkBase + $1)
redo
when embeddedID
# Find link embedded in MMD link
line.sub!(embeddedID, ‘\1(’ + $pre + findICFS($2, icfsList) + ‘)’)
redo
when bareID
if $1 != ‘’
ind = $1
else
ind = ‘id’
end
line.sub!(bareID, “[#{ind}” + ‘\2](’ + $pre + findICFS($2, icfsList) + ‘)’)
redo
else
print line
next
end
end[/code]

drd · March 22, 2011, 5:12pm

Thanks much. A few more questions of clarification.

First, where should I look for the Support.pm file? After a terminal search I see rather too many options, including in Scrivener and Perl itself.

/Applications/Scrivener.app/Contents/MacOS/MultiMarkdown/bin/MultiMarkdown/Support.pm
/Applications/Scrivener.app/Contents/Resources/MultiMarkdown/bin/MultiMarkdown/Support.pm
/Library/Perl/5.10.0/darwin-thread-multi-2level/NetSNMP/agent/Support.pm
/System/Library/Perl/Extras/5.10.0/XML/NamespaceSupport.pm
/System/Library/Perl/Extras/5.8.9/XML/NamespaceSupport.pm
/System/Library/Perl/Extras/5.10.0/Regexp/Common/_support.pm
/System/Library/Perl/Extras/5.8.9/Regexp/Common/_support.pm

Since I’m in the habit of using Fletcher’s MMD to override whatever version comes with Scrivener, is there a place to make these customizations independent of any updates that may occur in Scrivener? From what I can tell the Support.pm file is in MMD2, but not yet in the MMD3.0b10 distribution I’m running.

(UPDATE: I’ve asked about the future of the Support.pm file on the MMD discussion list: http://groups.google.com/group/multimarkdown/browse_thread/thread/568e9bd7390859e9 As FP explains, “MMD is now a single binary, rather than a collection of various perl scripts.” The Support.pm file is therefore not a part of MMD3.)

Second, do you have different formats for the different cross-reference types, or is it <^filename-or-path^> for everything? It would help to see what a typical keyword entry in your meta-data currently looks like.

On a slightly different note, are there scripting or automation benefits to using a 365 day count in your unique IDs instead of the MonthDay combo (which follows the UTC format beneath Unix time)? And is there any reason not to include a . or ’ or something between the date and time figures, which functions to make a machine-readable thing more recognizable to human eyes?

Again, thanks. When a senior colleague told me that for a decade he had worked in nothing but text files, I was dumbfounded. Impossible! Now, with an accumulation of project files tucked away in more than one proprietary format that I now prefer not to (or even cannot) use, I’m beginning to think there’s no other reasonable way.

AmberV · March 22, 2011, 10:52pm

Sorry, I should have clarified I was referring to the current stable version. I have not made an attempt to put MMD3 into my full workflow yet, but theoretically it should be easier. With the new version you just call multimarkdown and that’s it. So all you need to do is parse_links.rb | multimarkdown. The Support.pm file that I referred to is in ~/Library/Application Support/MultiMarkdown/bin/MultiMarkdown, so one level below the mmd2XHTML.pl type scripts.

Here are the cross reference formats, in an example before and after:

<|id11081956|> <|id11081956 “Note to yourself about what the link is”|>
[id11081956](File:///Users/uname/Archive/2011/11090/11081956-I-Example File.md)

The second one demonstrates an annotated link; the text will be completely dropped from the resulting MMD file.

Plain English
[Plain English](File:///Users/uname/Archive/2011/11090/11081956-I-Example File.md)

<^machinery/css/articles.css^>
File:///Users/uname/Archive/machinery/css/articles.css

One of the things I wanted to accomplish with the syntax was an increase in readability. To me, these are much easier to look at in a file than full URLs and links.

So use the last one when you need a raw file URL. The other whenever you don’t care to know the full path or even name of a file. If you also rename your collected graphics and media with an ICFS name, then you can insert that token into the URL side of an image link, too.

To be clear, the keyword is not actually used for this particular function. That’s there for other tools. If I have the above ID number then I can search for it using Spotlight and other similar tools. Because of the way I use this number as both an ID for the document and a method of referring to the document, such a search result returns the entire cloud of things. The original item itself, and everything that refers to it. This is enormously useful for long time-span idea formation. If you think of something you’d like to add to an idea, your mind might go to one of the most recent related ideas, and in doing so you’ll find a back-trace to an earlier thought “SeeAlso” etc. So one search can pull up not only the original thought, but all of the separate tangents that hark back to it as well.

A typical keyword entry might look like:

Keywords: {I1.1.Theory} id11081965

That’s it. I don’t often use any further keywords, as generally everything pertinent to the file has been already stated in the content or title, but if something remained unsaid that it is clearly about, I’ll toss in some helpers. That’s what I like about working in the content area for this kind of stuff. Since it’s all in one field, you don’t have to put redundant information anywhere. In a database that had a “Keywords” field, you’d have to put a bunch of redundant words in there because a keyword search would be ignorant of any genuinely useful words in the content area. Just searching the content area reduces the need for that, and if I really do need to isolate by keyword (say I’m looking for unsent e-mail messages, which have “Unsent” in the Keywords line, but I don’t want just every instance of that word returned), then I toss “Keywords:” into search phrase to make sure unsent is on that line. How to do that precisely depends on the software.

For scripting, I can’t think of any substantial benefit. For head math, sure. The main reason I use the datestamp that I use is because that is the datestamp that I use. Same reason most people use months and days—that’s just how I think of time. The calendar on my wall says today is the 81st. The only times I use months and days are when I’m filling in the date on a form or something.

I actually do that already. The datestamp that I use looks like: 11081:970. The colon means today is Tuesday. So not only is it very readable, it is very informative. Each day of the week has its own punctuation mark. That can be freeform because it is trivial to ignore or drop the 4th character from the right in a script. I don’t use it in filenames though because I’m kind of old school about not using too much punctuation in filenames for some reason. Probably all those years of UNIX. I equate using spaces and parentheses and so on with doubling how much I have to type in on the command line, escaping every other thing. For a while I played with 11081-973, as a filename, but in the end decided to compact it by one character and drop the punctuation so that I could keep filename columns more compact, and double-click would work the same everywhere (some consider a hyphen to be a word break). <|id11081973|> can be double-clicked and pasted into Spotlight in a jiffy. So it’s not only a useful machine tool, it’s a useful human tool, without a punctuation mark.

drd · March 23, 2011, 9:10pm

Right. The script relies on having the date string in the title. Point taken about the human use of a stamp that can be double clicked. What about the last three numbers? Milliseconds for a unique ID on any given day?

And icfs = index card filing system?

Finally, I see what you mean about HTML links taking you directly to the .md files. This introduces a slight hiccough the experience of surfing a personal web of knowledge. One also has to have the files rendered into some format to benefit. Do you tend to preview the files as HTML in TextMate to take advantage of live links?

All in all, very elegant. Honestly, I feel I ought to send you flowers. Thank you for sharing.

AmberV · March 23, 2011, 9:36pm

My pleasure!

Yeah, and mainly for performance reasons. Searching title names, even thousands of them, can be done in a jiffy and the script pre-caches the name list so that your disk is only trawled once per file. Searching for a keyword inside the file would be more difficult—not only because older entries do not use the Keyword system (and the search needs to find one file, so just searching for any instance would produce multiple hits in many cases), or even MMD which didn’t exist when they were written, but because searching file contents in a large context takes forever without an index like Spotlight or Scrivener does.

Plus, it was way easier to make a series of scripts that renamed everything to ICFS standard than to conform their contents to the modern system. That is something I pick at when I have spare time, but it’s taken many years. I just have too much crap.

Last three digits are just the day expressed to a fractional resolution of one thousandths; it’s mostly used in astronomy. One one thousandths of a day is equal to 86.4 seconds, so just a bit over a minute. I experimented with millionths for a while, which is closer to a single second, but never really found that I needed that level of resolution. This might seem incredibly impossible to work with, but it’s actually not too bad. Swatch tried to get people using this exact same system so you can actually buy wristwatches that display time in 1000ths of a day, not to mention menu clock replacements and other software utilities.

Most things take longer than 90 seconds to do, so by the time you need another ID, it’s already incremented. The only time this really doesn’t work is when importing a group of files together—typically pictures for a file. In that case I usually just pick a nearby blank spot large enough to accommodate the number of IDs I need, and use a tool like A Better Finder Renamer to bulk rename them with the ICFS number +1. But hey, that’s going to be a problem with any date stamp based system, and at least with fractions, you don’t have to worry about base 60 math.

ICFS = that, yes. Sorry, a little internal jargon there. ICFS is anything pertaining to the organisation and naming of files. PDS (Personal Decimal System; or maybe Petra’ka Decimal System, if I’m feeling heroic) is the definition of the token itself.

These days I’m back to my old editor for most things, Vim, so I have that scripted to run the editing buffer through MMD, save it to a file, and prompt my browser to open it—essentially no different, live links still work (with the same limitation), except that it addresses an annoying thing about TextMate’s preview in that it doesn’t handle footnotes or any text anchors for that matter, for this purpose, because it reloads from the file not the rendered temporary file—so you drop back to raw MMD. But TM makes a very nice platform for all of this. Honestly I’m not sure if the Vim thing is just a “phase”.

Hmm, I wonder if some of the more extensible browsers could have an M/MD plug-in that automatically renders stuff with one of the common MMD extensions. That’s a thought.

drd · April 4, 2011, 12:23pm

Hello again,

I’ve been looking into ways of creating a browser-navigable but local .html version of a .md archive, such as the one described above. Specifically, I’ve been looking at two specific kits (though I know there are a number of near-equivalents in the systems behind a number of .md based “baked” sites). One is Dr. Drang’s no-server personal wiki (here: github.com/drdrang/notes ). The other is Fletcher Penney’s own MMD-CMS.

In both cases, I run into trouble with your script, Ioa, when it stands at the front of a command pipeline. The technique of both systems is to generate a set of .html files parallel to the source .md files. For example, the following bits of a shell script work as written:

multimarkdown "$file_name" | xsltproc -nonet -novalid $xslt_path/XSLT/$mode -

multimarkdown "$1" | xsltproc -nonet -novalid $xslt_path/XSLT/$mode - > "$file_name.html"

But then they stall out on the first of a batch of files when I modify them as follows, with parse_links.rb put alongside where the multimarkdown binary command lives in MMD 3:

parse_links.rb "$file_name" | multimarkdown | xsltproc -nonet -novalid $xslt_path/XSLT/$mode -

parse_links.rb "$1" | multimarkdown | xsltproc -nonet -novalid $xslt_path/XSLT/$mode - > "$file_name.html"

Alternately, in a python script, this works just fine:

cmd = 'MultiMarkdown %s | SmartyPants.pl' % mdFile

but this pipeline hangs on the first file:

cmd = 'parse_links.rb %s | MultiMarkdown | SmartyPants.pl' % mdFile

Any ideas about what I’m missing?

DivineDominion · April 18, 2011, 8:07pm

Aww, shoot – so you’re saying all your INDEX documents are not affectey by your usual file naming conventions? I (try to) duplicate files upon changing them, just like Boswell does as you stated somewhere else. Having this kind of versioning sounds potentially useful. It’s just that I don’t fancy to update any referencing INDEX file, too, upon every change of its referees. To remove INDEX files from this file naming (and atomar versioning) paradigm could result in me feeling less friction when I want to update.

I think piping processed MMD files through a script is much more scalable than to keep on par with Fletcher Penney’s changes. Using MMD 3.0 is way faster, though.

( I’m involved in Notational Velocity customization and try to incorporate some of your best practices – christiantietze.tumblr.com )

@drd:

Could you specify where the ruby script fails? It could “stall” for a lot of reasons – try to “print” some test messages at strategically important parts of the script, like so:

[code]…

print “configuration works” % <--------------------

#inpt = File.open(‘test_data.txt’) { |fn| fn.readlines }
inpt = $stdin

process_input = inpt.readlines

print “lines read” % <--------------------

process_input.each do |line|
case line
when referredLink
print “referred” % <--------------------
line.sub!(referredLink, $1 + $pre + findICFS($2, icfsList))
redo
when fileLink
print “file” % <--------------------
# Fix file links
line.sub!(fileLink, $pre + $arkBase + $1)
redo
when embeddedID
print “embedded” % <--------------------
# Find link embedded in MMD link
line.sub!(embeddedID, ‘\1(’ + $pre + findICFS($2, icfsList) + ‘)’)
redo
when bareID
print “bare” % <--------------------
if $1 != ‘’
ind = $1
else
ind = ‘id’
end
line.sub!(bareID, “[#{ind}” + ‘\2](’ + $pre + findICFS($2, icfsList) + ‘)’)
redo
else
print “else?” % <--------------------
print line
next
end
end[/code]

This, of course, may render your output file unusuable. But you can track the point at which the script fails when applied to various test files by comparing what you expect to what is actually printed into the file. Try to spot some regularity in the way input files must be written in order to let the script fail, first.