Reimplementing Markdown in Arc Lisp

Markdown is a double edged sword in online commenting. It is so ubiquititous in popular community sites that it is expected to be used but it can be fragmented in what is used from the spec and how it is implemented. At hubski we use a small subset of the markdown spec with our own customizations (such as shout-outs). Normally I would use an appropriate library for parsing markdown and be done with it, but unfortunately no one has written one that I know of.

Arc's Markdown

What we do use is a modified version of news.arc which has a very simple implementation which only allows for italicizing, code blocks, and clickable links. It goes character by character and checks for a asterisk or double space at the start of a new line and then looks ahead for an applicable end character. After it parses the correct characters it then saves the parsed string in the appropriate file for the comment or submission. If you need to edit your text it then calls an unmarkdown function to convert it back. This works for pg and he and Nick Sivo (kogir) don't seem to have any desire to extend it.

This is the function which converts markdown strings to html:

(def markdown (s (o maxurl) (o nolinks))
  (let ital nil
    (tostring
      (forlen i s
        (iflet (newi spaces) (indented-code s i (if (is i 0) 2 0))
          (do (pr  "<p>< pre><code>")
            (let cb (code-block s (- newi spaces 1))
              (pr cb)
              (= i (+ (- newi spaces 1) (len cb))))
            (pr "</code></pre>"))
          (iflet newi (parabreak s i (if (is i 0) 1 0))
             (do (unless (is i 0) (pr "<p>"))
                 (= i (- newi 1)))
            (and (is (s i) #\*)
                 (or ital
                     (atend i s)
                     (and (~whitec (s (+ i 1)))
                          (pos #\* s (+ i 1)))))
             (do (pr (if ital "</i>" "<i>"))
                 (= ital (no ital)))
            (and (no nolinks)
                 (or (litmatch "http://" s i)
                     (litmatch "https://" s i)))
             (withs (n   (urlend s i)
                     url (clean-url (cut s i n)))
               (tag (a href url rel 'nofollow)
                 (pr (if (no maxurl) url (ellipsize url maxurl))))
               (= i (- n 1)))
             (writec (s i))))))))

Up until now we have hacked onto that implementation all of the formating we currently have. For things like bolding, quoting and strikethroughs that was easy enough: look for a character (*, +, |, or ~), look ahead for its matching character, replace as necessary. As we added more and more functionality it got progressively trickier to keep things from breaking.

Problems

One of the obvious problems that arises is that it doesn't account for nested markdown. For instance, when it sees a code block it prints out the contents of the block with the appropriate tags wrapped. It skips over checking for any more markdown withint the tags.

Another more subtle one is that it stores the html version of the string. What most websites do (from what I'm told) is store both the markdown and html versions. This way you only have to deal with a markdown to html parser and not the reverse. The benefits of this become obvious when you want to have unordered lists. For instance, according to the spec -, +, and * are all valid bullets for an unordered list. When converting back which do you choose to display to the user?

Solution

So in order to allow for nested markdown I decided to rewrite how we dealt with it. First we go through our input string and break it up into a list of tokens. Simple tokens are things like *, +, |, ~, \, [, ], (, ), and newlines. More complex ones include image and video links which get embedded. Luckily, if we separate things by spaces we'll get links by themselves. For instance we want to turn, "This is +bolded+.\nHere's a embedded link: http://i.imgur.com/zAuxnTw.gif" into

("This" #\space "is" #\space #\+ "bolded." #\newline "Here's" #\space "a" #\space "embedded" #\space "link:" #\space "http://i.imgur.com/zAuxnTw.gif")

Links provide their own problem. We want to be able to split up links of the form [Text](url) while allowing the url to contain parens. The way I ended up doing it is only tokenizing ('s which are preceeded by a ] and counting the number of open parens in the current buffered text.

From here we can use this list much in the same way as the original implementation but instead go token by token and build an html string. As we go through the list we can keep state through either tail recursion or local variables in a loop (which is what the original uses). I ended up using both in different places to simplify things. In this way we can signal when the text is in a bolded or italic state without skipping over anything.

Another thing we added was an escape character '\' in case you wanted to use an asterisk or two on their own.

Unfortunately much of our text data is still only stored in html string form. One day I will write a script to go through all our files and add the markdown versions of the text. We plan on moving our data to a proper database anyhow, so we'll probably end up doing that then as it will entail rewriting all our 'database' calls anyway.

Conclusion

This is definitely a better system for dealing with markdown and will make it easier for us to add more things in the future should the need arise. I've come to realize, however, that for the greatest extensibility and generalization I probably should have gone with more formal approach akin to a markdown compiler. When I started this I didn't know anything about how compilers worked, but if I were to do it again I would implement a relatively simple compiler with three phases: Lexical Analysis, Semantic Analysis, and Code Generation. Lexical Analysis would be similar in that it built a list of tokens. Semantic analysis would take each of these token and create a parse tree with appropriate nodes (God I long for a nice typing system). Then we would just walk through the tree and print out the approriate string. I believe with this approach we would be able to represent everything you ever wanted in a markdown. Hopefully with the restructuring of hubski, however, I won't need to.