StringZilla 🦖

StringZilla banner

The world wastes a minimum of $100M annually due to inefficient string operations. A typical codebase processes strings character by character, resulting in too many branches and data-dependencies, neglecting 90% of modern CPU's potential. LibC is different. It attempts to leverage SIMD instructions to boost some operations, and is often used by higher-level languages, runtimes, and databases. But it isn't perfect. 1️⃣ First, even on common hardware, including over a billion 64-bit ARM CPUs, common functions like strstr and memmem only achieve 1/3 of the CPU's throughput. 2️⃣ Second, SIMD coverage is inconsistent: acceleration in forward scans does not guarantee speed in the reverse-order search. 3️⃣ At last, most high-level languages can't always use LibC, as the strings are often not NULL-terminated or may contain the Unicode "Zero" character in the middle of the string. That's why StringZilla was created. To provide predictably high performance, portable to any modern platform, operating system, and programming language.

StringZilla code size

StringZilla is the GodZilla of string libraries, using SIMD and SWAR to accelerate string operations on modern CPUs. It is up to 10x faster than the default and even other SIMD-accelerated string libraries in C, C++, Python, and other languages, while covering broad functionality. It accelerates exact and fuzzy string matching, edit distance computations, sorting, lazily-evaluated ranges to avoid memory allocations, and even random-string generators.

🐂 C : Upgrade LibC's <string.h> to <stringzilla.h> in C 99
🐉 C++: Upgrade STL's <string> to <stringzilla.hpp> in C++ 11
🐍 Python: Upgrade your str to faster Str
🍎 Swift: Use the String+StringZilla extension
🦀 Rust: Use the StringZilla traits crate
🐚 Shell: Accelerate common CLI tools with sz_ prefix
📚 Researcher? Jump to Algorithms & Design Decisions
💡 Thinking to contribute? Look for "good first issues"
🤝 And check the guide to setup the environment
Want more bindings or features? Let me know!

Who is this for?

For data-engineers parsing large datasets, like the CommonCrawl, RedPajama, or LAION.
For software engineers optimizing strings in their apps and services.
For bioinformaticians and search engineers looking for edit-distances for USearch.
For DBMS devs, optimizing LIKE, ORDER BY, and GROUP BY operations.
For hardware designers, needing a SWAR baseline for strings-processing functionality.
For students studying SIMD/SWAR applications to non-data-parallel operations.

Performance

<table style="width: 100%; text-align: center; table-layout: fixed;"> <colgroup> <col style="width: 25%;"> <col style="width: 25%;"> <col style="width: 25%;"> <col style="width: 25%;"> </colgroup> <tr> <th align="center">C</th> <th align="center">C++</th> <th align="center">Python</th> <th align="center">StringZilla</th> </tr>  <tr> <td colspan="4" align="center">find the first occurrence of a random word from text, ≅ 5 bytes long</td> </tr> <tr> <td align="center"> <code>strstr</code> 1 x86: 7.4 · arm: 2.0 GB/s </td> <td align="center"> <code>.find</code> x86: 2.9 · arm: 1.6 GB/s </td> <td align="center"> <code>.find</code> x86: 1.1 · arm: 0.6 GB/s </td> <td align="center"> <code>sz_find</code> x86: 10.6 · arm: 7.1 GB/s </td> </tr>  <tr> <td colspan="4" align="center">find the last occurrence of a random word from text, ≅ 5 bytes long</td> </tr> <tr> <td align="center">⚪</td> <td align="center"> <code>.rfind</code> x86: 0.5 · arm: 0.4 GB/s </td> <td align="center"> <code>.rfind</code> x86: 0.9 · arm: 0.5 GB/s </td> <td align="center"> <code>sz_rfind</code> x86: 10.8 · arm: 6.7 GB/s </td> </tr>  <tr> <td colspan="4" align="center">split lines separated by <code>\n</code> or <code>\r</code> 2</td> </tr> <tr> <td align="center"> <code>strcspn</code> 1 x86: 5.42 · arm: 2.19 GB/s </td> <td align="center"> <code>.find_first_of</code> x86: 0.59 · arm: 0.46 GB/s </td> <td align="center"> <code>re.finditer</code> x86: 0.06 · arm: 0.02 GB/s </td> <td align="center"> <code>sz_find_charset</code> x86: 4.08 · arm: 3.22 GB/s </td> </tr>  <tr> <td colspan="4" align="center">find the last occurrence of any of 6 whitespaces 2</td> </tr> <tr> <td align="center">⚪</td> <td align="center"> <code>.find_last_of</code> x86: 0.25 · arm: 0.25 GB/s </td> <td align="center">⚪</td> <td align="center"> <code>sz_rfind_charset</code> x86: 0.43 · arm: 0.23 GB/s </td> </tr>  <tr> <td colspan="4" align="center">Random string from a given alphabet, 20 bytes long 5</td> </tr> <tr> <td align="center"> <code>rand() % n</code> x86: 18.0 · arm: 9.4 MB/s </td> <td align="center"> <code>uniform_int_distribution</code> x86: 47.2 · arm: 20.4 MB/s </td> <td align="center"> <code>join(random.choices(...))</code> x86: 13.3 · arm: 5.9 MB/s </td> <td align="center"> <code>sz_generate</code> x86: 56.2 · arm: 25.8 MB/s </td> </tr>  <tr> <td colspan="4" align="center">Get sorted order, ≅ 8 million English words 6</td> </tr> <tr> <td align="center"> <code>qsort_r</code> x86: 3.55 · arm: 5.77 s </td> <td align="center"> <code>std::sort</code> x86: 2.79 · arm: 4.02 s </td> <td align="center"> <code>numpy.argsort</code> x86: 7.58 · arm: 13.00 s </td> <td align="center"> <code>sz_sort</code> x86: 1.91 · arm: 2.37 s </td> </tr>  <tr> <td colspan="4" align="center">Levenshtein edit distance, ≅ 5 bytes long</td> </tr> <tr> <td align="center">⚪</td> <td align="center">⚪</td> <td align="center"> via <code>jellyfish</code> 3 x86: 1,550 · arm: 2,220 ns </td> <td align="center"> <code>sz_edit_distance</code> x86: 99 · arm: 180 ns </td> </tr>  <tr> <td colspan="4" align="center">Needleman-Wunsch alignment scores, ≅ 10 K aminoacids long</td> </tr> <tr> <td align="center">⚪</td> <td align="center">⚪</td> <td align="center"> via <code>biopython</code> 4 x86: 257 · arm: 367 ms </td> <td align="center"> <code>sz_alignment_score</code> x86: 73 · arm: 177 ms </td> </tr> </table>

StringZilla has a lot of functionality, most of which is covered by benchmarks across C, C++, Python and other languages. You can find those in the ./scripts directory, with usage notes listed in the CONTRIBUTING.md file. Notably, if the CPU supports misaligned loads, even the 64-bit SWAR backends are faster than either standard library.