TLDR
For impatient readers, here is the link to jump to talk is cheap show me the code section.
Or directly read the source code and try out the online demo .
Chinese text segmentation
Unlike many writing systems , Chinese text is naturally not segmented by words using white space. Instead, words and words stick together in a sequence, and white space is only used around punctuation marks.
Under this circumstance, text segmentation is important. Lacking implementation of text segmentation, Telegram search is terrible for Chinese (, Japanese, Thai, etc.) users, and mdBook become less appealing for Chinese e-book authors and/or readers.
Jieba on Python, the good and the bad
Jieba is a Python library for Chinese text segmentation, and probably the best choice for a lightweight NLP (natural language processing) service. It is easy to install and easy to use, but it still comes at price as a part of web applications.
The first cost is the accessibility. Jieba is a Python library. Although Python seems ubiquitous over the internet, the majority of online Python environments are Jupyter notebooks. Free web hosting services for Python applications are not widely available (eg. not available in Firebase) and not well supported (eg. not well supported in Vercel).
In contrast, JavaScript (or even Node.js) is the only ubiquitous programming language in the web. Browsers are the world’s most accessible code-executing environment. Let along browsers, even if you dislike NPM (Node.js Package Manager) like me, you can still create, build and deploy Node.js web application not only free of charge but also free of NPM. JS-related web hosting is so popular that you can always find more than 3 alternative free online service to to create / host JS applications.
I can create a Node.js-backed website on <codesandbox.io>, and easily deploy it to Vercel . But how can I publish a Sanic -based website if I can’t afford a server or any online paid service? Lest some reader is not aware, <pythonanywhere.com> does not support asyncio-based web frameworks like Sanic.
The second price is the speed. It’s well-known that Python is slow , even when compared with many other script languages. Since text segmentation is basically a CPU-bound task, processing speed is a real concern. Thankfully, the algorithm of Jieba was re-written in C++ and Rust, and those implementation performs amazingly in benchmarks.
To wrap up, Jieba on Python is overall good, but not good enough for web applications. For web applications, we want Jieba library to be widely available and well-supported like JavaScript, and we also want it to be fast and efficient like C++ and Rust. Then, the answer is unearthed now.
Use Jieba.wasm
The answer is WebAssembly. However, there is only one Jieba WebAssembly module available online, and this one is only available on NPM. The problem is that NPM is a killer for computer hard disks. I tend not to use NPM unless I have no choice, or unless someday I don’t worry about worning out my precious hard disk.
Is there a way to use Jieba WebAssembly module on NPM without installing NPM? <cdn.skypack.dev> claimed that their service can convert NPM into some browser/deno compatible formats. But it doesn’t work at this time, and I will explain why later.
Since Jieba WebAssembly module was only distributed on NPM, we have to download the package manually from npmjs API , so that we can get the essential files of compiled wasm binary.
On Deno:
- Get download link and download the npm package
on npmjs API , the 0.0.2 package is at https://registry.npmjs.org/jieba-wasm/-/jieba-wasm-0.0.2.tgz
- Unpack the npm package by 7-zip or
tar -xzf
, and then extact this folderjieba-wasm-0.0.2.tgz\package\pkg\web\
- Take 2 files, .js and .wasm, from
jieba-wasm-0.0.2.tgz\package\pkg\web\
- Below is the code on Deno to try out the Jieba WebAssembly module.
import init, { cut } from "./my/path/to/jieba_rs_wasm.js";
await init(); // ensure the .wasm file and the imported js file is under same folder.
console.log(await cut("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作", true));
- Save file as
main.js
and rundeno --unstable run -A main.js
.
On browser:
- Pretty much the same as Deno, but instead of
import
statement, useimport()
function. Otherwise, thescript
tag in the HTML document has to be written as<script type="module"></script>
, which makes the scripts' namespace be separated from HTML as an ES module.
<script>
const {default: init,
cut,
} = await import("./jieba_rs_wasm.js");
await init();
console.log(await cut("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作", true));
main();
</script>
- Optionally, load the wasm from your favourite CDN
await init("https://cxumol.github.io/jieba-wasm-html/jieba_rs_wasm_bg.wasm");
Now let’s look back to <cdn.skypack.dev> ’s issue. The Deno.js code can be like this:
import init, { cut } from "https://cdn.skypack.dev/jieba-wasm?dts";
await init();
console.log(await cut("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作", true));
and error message will be like this:
>deno --unstable run -A jieba-wasm.deno.js
Download https://cdn.skypack.dev/-/jieba-wasm@v0.0.2-9shvs7DtdzO6wlyEKmSp/dist=es2019,mode=types/index.d.ts
error: Uncaught TypeError: Invalid URL
input = new URL("jieba_rs_wasm_bg.wasm", import_meta.url);
^
at Object.opSync (deno:core/01_core.js:170:12)
at opUrlParse (deno:ext/url/00_url.js:48:27)
at new URL (deno:ext/url/00_url.js:322:20)
at init (https://cdn.skypack.dev/-/jieba-wasm@v0.0.2-9shvs7DtdzO6wlyEKmSp/dist=es2019,mode=imports/optimized/jieba-wasm.js:260:13)
at file:///C:/code/jieba-wasm.deno.js:2:7
To fix this problem, we have to specify the wasm file path when initializing, like await init("https://cxumol.github.io/jieba-wasm-html/jieba_rs_wasm_bg.wasm");
. Then loading from Skypack CDN will work well, on the condition of wasm binary downloaded outside this CDN.
By the way, before I looked into “jieba-rs.wasm”, I also checked deno-jieba on GitHub. It calls methods from shared library (natively-compiled Rust Jieba) so it should have a higher performance than its wasm counterpart. Ironically, “deno-jieba” is not really ready for use on Deno right now.
Bechmarks
I made a demo website 结巴中文分词网页版 . It is a public service similar to 结巴分词演示 by GumbleX , which has a Python Jieba server on the backend.
These two websites are then comparable for evaluating difference of performance between “Jieba.py” and “Jieba.wasm”.
By feeding in a long text (钱锺书_灵感.txt, 77.5 KB), we can find that our WebAssembly-based Jieba web application is about 45 times faster than a typical Python Jieba web application. (YMMV)
Result
- https://cxumol.github.io/jieba-wasm-html/ takes about 40 ms to finish a task, excluding initiation
- https://app.gumble.pw/jiebademo/ takes about 1.8 s to finish a task
Screenshots
On https://cxumol.github.io/jieba-wasm-html/ , initiation takes a relatively long time, otherwise it’s super fast.
https://app.gumble.pw/jiebademo/ is laggy when text is long.
Server responding plays a major role in these tasks on https://app.gumble.pw/jiebademo/
Conclusion
WebAssembly Jieba is so good for processing Chinese text on web applications.
There are some obstacles for sure, like: not-ergonomic module-importing mechanism in EMCAscript, not-ready-to-access wasm module for browser-side script. But overall, it’s still worth to implement for the sake of performance.
Answers to homework questions from the previous post
- GPUs are expensive and out of stock these years (update: but not now), blaming on crypto mining; GPU encoding generates larger video files by default. Ref 1 https://www.reddit.com/r/linux/comments/4wncug/hevc_nvenc_ffmpeg_huge_file_size/. Ref 2 https://unix.stackexchange.com/a/677315
- Remuxing differs from transcoding in that remuxing a file simply repackages the existing streams while transcoding actually creates new ones from a source. Ref https://support.plex.tv/articles/201097958-remuxing-files-to-mkv/
- RTFM or switch to presets starting w/o H265 and the explanation will show up
- Go to “Help + support” section on Azure and send a support request for help.
Answers above are only for referential purposes.
Homework
My dear readers, here is your homework:
- Why not use Intl.Segmenter for text segmentation on browser?
- I like Cloudflare Workers, but I don’t use Cloudflare wrangler because it’s depending on NPM. WebAssembly Modules was accessible on Cloudflare Workers' Web UI before, but not now. In this situation, how can I use WebAssembly Modules on Cloudflare Workers?
You are welcomed to comment your answer below. My answer will be given in the next post.