caption: project logo
In Telegram, you can format your messages into visually appealing rich-text. But how can you convert them back to markdown? Telegram Channel editors may find difficulties to switch between Telegram and markdown editors. To resolve this challenge, a bot has been created to translate rich-text to markdown. In this guide, we will explore how this conversion can be achieved and delve into the underlying logic of markup languages.
Three Types of Rich-Text in Telegram
Exploring the differences between rich-text rendered by clients, TL-schema entities, and transferred messages
When we talk about Telegram’s rich-text, we can easily mix up 3 different concepts:
- the rich-text rendered by clients. It is usually platform-specific, such as Qt and MacOS ;
- the rich-text entity defined by TL-schema . It is used within Telegram’s internal API;
- and the rich-text message transferred over (encrypted) plain text-based protocols like HTTP(S). It’s usually used for Telegram Bot API.
In this post, we will focus on the third type of data due to its accessibility, operability, and universality.
caption: telegram logo on typewriter
Telegram Rich-Text Formatting: Understanding the JSON Request
Breaking down the structure of Telegram’s rich-text messages in Bot API
Typically, if a user send a rich-text message to a Telegram bot:
I
can eat glass
; It _does _not hurt me
The bot will receive this message as a JSON request. For example:
{
"update_id": 12345678,
"message": {
"message_id": 12345678,
"from": {
"id": 12345678,
"is_bot": false,
"first_name": "example",
"last_name": "example",
"username": "cxumol",
"language_code": "example"
},
"chat": {
"id": 12345678,
"first_name": "example",
"last_name": "example",
"username": "cxumol",
"type": "private"
},
"date": 12345678,
"text": "I can eat glass; It does not hurt me",
"entities": [
{
"offset": 2,
"length": 13,
"type": "code"
},
{
"offset": 20,
"length": 5,
"type": "italic"
},
{
"offset": 20,
"length": 4,
"type": "bold"
},
{
"offset": 25,
"length": 4,
"type": "italic"
},
{
"offset": 25,
"length": 4,
"type": "strikethrough"
},
{
"offset": 29,
"length": 7,
"type": "bold"
},
{
"offset": 29,
"length": 7,
"type": "italic"
},
{
"offset": 29,
"length": 7,
"type": "strikethrough"
}
]
}
}
Here, we can identify two relevant fields "text"
and "entities"
. The "text"
field contains just the plain text, and the "entities"
field defined 3 key-value pairs: "offset"
the starting position, "length"
the range, and "type"
the style name, for each piece of text to render.
Such structure is fairly straightforward for Telegram clients to render, but it brings certain difficulties when restructuring "text"
and "entities"
in markdown. Because you don’t want to end up with **[bold** link](https://example.com)
, right?
A Sorting Algorithm for HTML tags and Markdown Markups
Examining the strategy provided by @CodeHz for sorting HTML-tags/markdown-markups in the correct position
In this case, we need a sorting algorithm for positioning our markdown markups. Special thanks to @CodeHz for his great idea. He generously provided a sorting strategy as follows: first sort in the order of the starting position, and when two entities share the same the starting position, the one with last ending position (aka longest) should come first.
caption: the illustration of @CodeHz sorting strategy
If we insert HTML tags or Markdown marks one by one in this sorted order, it will result in the correct placement of styles in the final string.
Now let’s consider implementing this sorting algorithm. (Server-side) JavaScript can be a good choice in favor of the growing popularity of FaaS (Function-as-a-Service). A typical ascending numerical sort in JavaScript is written as myArray.sort((a, b) => a - b);
. In this particular case, the sorting algorithm will be entities.sort((a, b) => a.offset - b.offset || b.length - a.length)
.
Pretty concise and clear. However, it may not be completely meaningful as Telegram splits styled entities into minimal possible segments (at least in the current API version). As shown in the schematic, when telegram parse the styles, overlapping "entities"
always share the same starting and ending positions.
caption: understand how telegram segments rich-text
A Practical Solution for Wrapping Markups
A deeper look at markups and string manipulation in markdown and JavaScript
The solution is nearly within reach: simply insert markup symbols at the calculated position according to data provided by the Telegram Bot API. However, there are still two major issues to resolve.
The first issue stems from what just mentioned: the segmentation strategy end up with most fragmented "entities"
. Unlike HTML and XML, markdown does not distinguish between starting and ending tags (markups). This can trouble some markdown editors, causing translated-markdown strings such as **bold****_bolditalic_**plain
to be parsed in a glitchy manner.
caption: glitchy parsed markdown in Obsidian.md
1.1.9
To overcome this issue, one orthodox solution is to merge adjacent same-style markups into one so that the @CodeHz sorting strategy can be better utilized. However, it’s not always necessary because simply removing ****
, __
and ~~~~
from the translated markdown string can produce the same result.
The other issue is more traditional. In the “easy” mode where styles never overlap, translating rich-text to markdown can be achieved through splitting, wrapping, and joining. However, if styles overlap, this 3-step solution becomes invalid because multiple markups could become mixed after a single wrap.
caption: when text styles never overlap, directly wrapping on substrings can work well
At the algorithmic level, how can we ensure that inserted markups are not in disorder? Intuitively, we can repeat the split-wrap-join steps recursively, but this approach would be effective only with Telegram’s current segmentation strategy, in which markups with overlapping styles always have the same starting and ending positions. Here, I aim to develop a more universal algorithm that will remain functional even if Telegram (or a related library) changes its tokenization strategy to the longest possible range in the future.
caption: when text styles never overlap, that would be the algorithm we need
To clarify the illustration, the counterparts to the blue **
and green _
have been omitted as it is assumed that readers are already familiar with markdown. The emphasis is placed on the fact that “ending markups” should go to their “ending slot” and “starting markups” should go to their “starting slot” in the correct order.
Now, we can examine the algorithms illustrated above in a practical manner. It is important to remember that "entities"
is the name of the field in Telegram’s JSON API that describes the style type and position. Instead of directly wrapping extracted substrings, the program should identify possible insertion points (i.e. gaps between "entities"
substrings), and divide each of these positions into two slots (implemented as Array
in JavaScript): one for ending markups on the left and one for starting markups on the right. As the program loops through sorted "entities"
, every new “ending markup” from the entity list is appended to the tail (right-side) of its corresponding slot, and every new “starting markup” is added to the head (left-side) of its corresponding slot. Once these steps are done, the final converted string can be created by seamlessly joining the split substrings and inserting the filled-up markup slots in between.
That concludes the explanation of the critical algorithms. Additionally, there are numerous edge cases related to markdown that need to be addressed, such as escaping [
in link text, and encoding special characters in URLs. Special thanks to @yzqzss
and @gledos
for their ingenious suggestions about edge cases in markdown.
caption: Bot Preview
As of the latest version, there are still some imperfections for edge cases and markdown features. Here is an incomplete list of those imperfections:
|
won’t be escaped as table is not defined in Telegram rich-text; However some markdown editor will treat|
as a part of markdown table.- Unordered list starting with
*
(loved by org-mode writers) won’t work
In terms of framework and serving platform, I decided on GrammY and Deno Deploy due to their flexibility, cost-effectiveness and ease of serverless function deployment.
Learn more: Visit the GitHub Repository
If you would like to try the demo, read the full source code, or have any ideas/suggestions for this project, please don’t hesitate to visit https://github.com/cxumol/tg2md-bot. The developer will be encouraged if you can kindly give a star, contribute to the project, or support the developer in any way.
Answers to the previous post’s homework
- Because Mozilla Firefox does not support Intl.Segmenter . If something is not supported on Firefox, it cannot be considered a web browser feature.
- Use Denoflare
Answers above are only for referential purposes.
Homework for this post
My dear readers, here is your homework:
- Why not pandoc ?
- How would you evaluate an alternative solution: Converting to HTML and then converting to markdown
- Is the method of representing rich text using
"text"
and"entities"
unique to Telegram, or is it present in other software as well? - I can’t see how such a simple conversion function deserves such a long article! It doesn’t even involve any profound/sophisticated algorithms/data structures/mathematics. How useful is it for revealing tons of messy, redundant, low-tech details?
You are welcomed to comment your answer below. My answer will be given in the next post.