Rich-Text in Telegram: A Guide to Converting to Markdown and Understanding the Logic Behind Markup Languages

2023-02-05

bot logo
caption: project logo

In Telegram, you can format your messages into visually appealing rich-text. But how can you convert them back to markdown? Telegram Channel editors may find difficulties to switch between Telegram and markdown editors. To resolve this challenge, a bot has been created to translate rich-text to markdown. In this guide, we will explore how this conversion can be achieved and delve into the underlying logic of markup languages.

Three Types of Rich-Text in Telegram

Exploring the differences between rich-text rendered by clients, TL-schema entities, and transferred messages

When we talk about Telegram’s rich-text, we can easily mix up 3 different concepts:

  1. the rich-text rendered by clients. It is usually platform-specific, such as Qt and MacOS ;
  2. the rich-text entity defined by TL-schema . It is used within Telegram’s internal API;
  3. and the rich-text message transferred over (encrypted) plain text-based protocols like HTTP(S). It’s usually used for Telegram Bot API.

In this post, we will focus on the third type of data due to its accessibility, operability, and universality.

typewriter-tg-edit
caption: telegram logo on typewriter

Telegram Rich-Text Formatting: Understanding the JSON Request

Breaking down the structure of Telegram’s rich-text messages in Bot API

Typically, if a user send a rich-text message to a Telegram bot:

I can eat glass; It _does _not hurt me

The bot will receive this message as a JSON request. For example:

{
  "update_id": 12345678,
  "message": {
    "message_id": 12345678,
    "from": {
      "id": 12345678,
      "is_bot": false,
      "first_name": "example",
      "last_name": "example",
      "username": "cxumol",
      "language_code": "example"
    },
    "chat": {
      "id": 12345678,
      "first_name": "example",
      "last_name": "example",
      "username": "cxumol",
      "type": "private"
    },
    "date": 12345678,
    "text": "I can eat glass; It does not hurt me",
    "entities": [
      {
        "offset": 2,
        "length": 13,
        "type": "code"
      },
      {
        "offset": 20,
        "length": 5,
        "type": "italic"
      },
      {
        "offset": 20,
        "length": 4,
        "type": "bold"
      },
      {
        "offset": 25,
        "length": 4,
        "type": "italic"
      },
      {
        "offset": 25,
        "length": 4,
        "type": "strikethrough"
      },
      {
        "offset": 29,
        "length": 7,
        "type": "bold"
      },
      {
        "offset": 29,
        "length": 7,
        "type": "italic"
      },
      {
        "offset": 29,
        "length": 7,
        "type": "strikethrough"
      }
    ]
  }
}

Here, we can identify two relevant fields "text" and "entities". The "text" field contains just the plain text, and the "entities" field defined 3 key-value pairs: "offset" the starting position, "length" the range, and "type" the style name, for each piece of text to render.

Such structure is fairly straightforward for Telegram clients to render, but it brings certain difficulties when restructuring "text" and "entities" in markdown. Because you don’t want to end up with **[bold** link](https://example.com), right?

A Sorting Algorithm for HTML tags and Markdown Markups

Examining the strategy provided by @CodeHz for sorting HTML-tags/markdown-markups in the correct position

In this case, we need a sorting algorithm for positioning our markdown markups. Special thanks to @CodeHz for his great idea. He generously provided a sorting strategy as follows: first sort in the order of the starting position, and when two entities share the same the starting position, the one with last ending position (aka longest) should come first.

image
caption: the illustration of @CodeHz sorting strategy

If we insert HTML tags or Markdown marks one by one in this sorted order, it will result in the correct placement of styles in the final string.

Now let’s consider implementing this sorting algorithm. (Server-side) JavaScript can be a good choice in favor of the growing popularity of FaaS (Function-as-a-Service). A typical ascending numerical sort in JavaScript is written as myArray.sort((a, b) => a - b);. In this particular case, the sorting algorithm will be entities.sort((a, b) => a.offset - b.offset || b.length - a.length).

Pretty concise and clear. However, it may not be completely meaningful as Telegram splits styled entities into minimal possible segments (at least in the current API version). As shown in the schematic, when telegram parse the styles, overlapping "entities" always share the same starting and ending positions.

understand telegram text style segmentation
caption: understand how telegram segments rich-text

A Practical Solution for Wrapping Markups

A deeper look at markups and string manipulation in markdown and JavaScript

The solution is nearly within reach: simply insert markup symbols at the calculated position according to data provided by the Telegram Bot API. However, there are still two major issues to resolve.

The first issue stems from what just mentioned: the segmentation strategy end up with most fragmented "entities". Unlike HTML and XML, markdown does not distinguish between starting and ending tags (markups). This can trouble some markdown editors, causing translated-markdown strings such as **bold****_bolditalic_**plain to be parsed in a glitchy manner.

obsidian-err-parse
caption: glitchy parsed markdown in Obsidian.md 1.1.9

To overcome this issue, one orthodox solution is to merge adjacent same-style markups into one so that the @CodeHz sorting strategy can be better utilized. However, it’s not always necessary because simply removing ****, __ and ~~~~ from the translated markdown string can produce the same result.

The other issue is more traditional. In the “easy” mode where styles never overlap, translating rich-text to markdown can be achieved through splitting, wrapping, and joining. However, if styles overlap, this 3-step solution becomes invalid because multiple markups could become mixed after a single wrap.

when text styles never overlap
caption: when text styles never overlap, directly wrapping on substrings can work well

At the algorithmic level, how can we ensure that inserted markups are not in disorder? Intuitively, we can repeat the split-wrap-join steps recursively, but this approach would be effective only with Telegram’s current segmentation strategy, in which markups with overlapping styles always have the same starting and ending positions. Here, I aim to develop a more universal algorithm that will remain functional even if Telegram (or a related library) changes its tokenization strategy to the longest possible range in the future.

when text styles overlap
caption: when text styles never overlap, that would be the algorithm we need

To clarify the illustration, the counterparts to the blue ** and green _ have been omitted as it is assumed that readers are already familiar with markdown. The emphasis is placed on the fact that “ending markups” should go to their “ending slot” and “starting markups” should go to their “starting slot” in the correct order.

Now, we can examine the algorithms illustrated above in a practical manner. It is important to remember that "entities" is the name of the field in Telegram’s JSON API that describes the style type and position. Instead of directly wrapping extracted substrings, the program should identify possible insertion points (i.e. gaps between "entities" substrings), and divide each of these positions into two slots (implemented as Array in JavaScript): one for ending markups on the left and one for starting markups on the right. As the program loops through sorted "entities", every new “ending markup” from the entity list is appended to the tail (right-side) of its corresponding slot, and every new “starting markup” is added to the head (left-side) of its corresponding slot. Once these steps are done, the final converted string can be created by seamlessly joining the split substrings and inserting the filled-up markup slots in between.

That concludes the explanation of the critical algorithms. Additionally, there are numerous edge cases related to markdown that need to be addressed, such as escaping [ in link text, and encoding special characters in URLs. Special thanks to @yzqzss and @gledos for their ingenious suggestions about edge cases in markdown.

telegram bot preview
caption: Bot Preview

As of the latest version, there are still some imperfections for edge cases and markdown features. Here is an incomplete list of those imperfections:

  • | won’t be escaped as table is not defined in Telegram rich-text; However some markdown editor will treat | as a part of markdown table.
  • Unordered list starting with * (loved by org-mode writers) won’t work

In terms of framework and serving platform, I decided on GrammY and Deno Deploy due to their flexibility, cost-effectiveness and ease of serverless function deployment.

Learn more: Visit the GitHub Repository

If you would like to try the demo, read the full source code, or have any ideas/suggestions for this project, please don’t hesitate to visit https://github.com/cxumol/tg2md-bot. The developer will be encouraged if you can kindly give a star, contribute to the project, or support the developer in any way.

Answers to the previous post’s homework

  1. Because Mozilla Firefox does not support Intl.Segmenter . If something is not supported on Firefox, it cannot be considered a web browser feature.
  2. Use Denoflare

Answers above are only for referential purposes.

Homework for this post

My dear readers, here is your homework:

  1. Why not pandoc ?
  2. How would you evaluate an alternative solution: Converting to HTML and then converting to markdown
  3. Is the method of representing rich text using "text" and "entities" unique to Telegram, or is it present in other software as well?
  4. I can’t see how such a simple conversion function deserves such a long article! It doesn’t even involve any profound/sophisticated algorithms/data structures/mathematics. How useful is it for revealing tons of messy, redundant, low-tech details?

You are welcomed to comment your answer below. My answer will be given in the next post.

GuideAnalysisDevelopTelegrammarkdownrich-texttranslationmarkup languages



Support Me

您可以 打赏 支持本文作者

Use Jieba on browser and Deno (NPM-free!)