Chapter 7: Exploring Built-in Codecs and Their Applications

Introduction

Welcome back, aspiring data compression expert! In the previous chapters, you’ve learned about the fundamental concepts of OpenZL and how to get it set up on your system. You’ve grasped the idea that OpenZL isn’t just another ‘black box’ compressor; it’s a powerful framework designed to build specialized compressors tailored to your data’s unique structure.

This chapter is where we dive into the heart of that specialization: built-in codecs. Think of codecs as the individual tools in OpenZL’s toolkit. By understanding what these tools do and how to apply them, you unlock the true potential of format-aware compression. We’ll explore some of the common built-in codecs, understand their purpose, and see them in action with practical examples. Get ready to select the perfect compression strategy for your structured data!

Core Concepts: OpenZL’s Codec Philosophy

At its core, OpenZL’s strength lies in its modularity. Instead of a single, monolithic algorithm, it provides a library of individual “codecs.” Each codec is designed to handle a specific type of data or a particular compression strategy, making it highly efficient for its intended purpose.

What is a Codec?

In OpenZL, a codec (short for coder-decoder) is a component that knows how to compress and decompress a specific type of data or apply a particular transformation. When you tell OpenZL about your data’s structure using a DataGraph, you then assign appropriate codecs to the nodes in that graph. OpenZL then intelligently orchestrates these codecs to achieve optimal compression.

Imagine you’re packing a suitcase. You wouldn’t just throw everything in randomly. You’d fold clothes (a “textile codec”), roll socks (another “textile codec” variant), put toiletries in a bag (a “liquid container codec”), and so on. Each item type gets a specialized handling method. OpenZL works similarly, but for data!

The Power of Specialization

Generic compressors like Gzip or Zstd are excellent all-rounders, but they treat all data as a raw stream of bytes. OpenZL, however, understands the semantics of your data. If you have a column of integers, it won’t try to compress them like text; it will use an integer-optimized codec. This specialization leads to significantly better compression ratios and often faster compression/decompression for structured data.

Common Categories of Built-in Codecs

OpenZL offers a rich set of built-in codecs, each optimized for different data types and patterns. While the exact list can evolve, here are some common categories you’ll encounter in the latest stable OpenZL release (as of January 2026):

Integer Codecs: Designed for various integer types (e.g., int32_t, int64_t). Often employ techniques like delta encoding, variable-byte encoding, or specialized bit packing.
Floating-Point Codecs: Optimized for float and double values, often leveraging techniques that exploit the commonalities in their binary representations or patterns in time-series data.
String Codecs: For sequences of characters. May use dictionary encoding, prefix/suffix compression, or other text-specific methods.
Run-Length Encoding (RLE) Codecs: Excellent for data with long sequences of identical values.
Dictionary Codecs: Identify repeating patterns or values and replace them with shorter codes. Highly effective for categorical data or frequently occurring strings/numbers.
Boolean Codecs: Efficiently store true/false values.
Raw Codecs: A fallback or pass-through codec for data that doesn’t fit other categories or where no further compression is desired.

How Codecs Work Together

A DataGraph describes the relationships between different parts of your data. Each node in this graph represents a piece of data (e.g., a column in a table, a field in a struct). You assign a specific codec to each node. When you initiate compression, OpenZL traverses the DataGraph, applies the assigned codecs to their respective data segments, and combines the compressed outputs.

Here’s a simplified view of how a DataGraph might use different codecs:

flowchart TD A["Input Data Stream"] --> B{"Parse Data Structure"} B -->|"Extract Column 1 (Ints)"| C["Integer Codec"] B -->|"Extract Column 2 (Strings)"| D["String Dictionary Codec"] B -->|"Extract Column 3 (Booleans)"| E["Boolean Codec"] C -->|"Compressed Ints"| F["Combined Compressed Stream"] D -->|"Compressed Strings"| F E -->|"Compressed Booleans"| F F --> G["Output Compressed Data"]

Figure 7.1: Conceptual flow of data through a DataGraph with multiple codecs.

Step-by-Step Implementation: Compressing Integers with a Built-in Codec

Let’s get our hands dirty! We’ll start with a simple example: compressing a vector of integers. We’ll define a basic DataGraph and assign an integer-specific codec to it.

First, ensure you have OpenZL installed and configured as per Chapter 2. We’ll create a C++ program.

Step 1: Set up your project

Create a new C++ file, say codec_example.cpp.

// codec_example.cpp
#include <iostream>
#include <vector>
#include <string>

// Include OpenZL headers
#include <openzl/openzl.h>
#include <openzl/codecs/integer_codec.h> // For integer compression
#include <openzl/data_graph.h>           // For defining data structure

int main() {
    std::cout << "OpenZL Built-in Codec Example: Integer Compression" << std::endl;

    // Our sample data: a vector of integers
    std::vector<int32_t> data = {100, 101, 102, 103, 100, 105, 106, 107, 100, 109};
    std::cout << "Original data size: " << data.size() * sizeof(int32_t) << " bytes" << std::endl;

    // In a real application, you'd define a more complex DataGraph
    // For this simple example, we'll create a single node representing our vector.
    openzl::DataGraph graph;
    
    // Add a node to our graph. Let's call it "my_integers".
    // This node represents the stream of int32_t data.
    auto& node = graph.add_node("my_integers");

    // CRITICAL: Assign the IntegerCodec to this node.
    // We specify the type of integer it handles (int32_t).
    node.set_codec<openzl::codecs::IntegerCodec<int32_t>>();

    // Now, let's prepare the data for compression.
    // OpenZL needs a map of node names to raw data buffers.
    std::map<std::string, const openzl::Buffer> input_data_map;
    input_data_map["my_integers"] = openzl::Buffer(data.data(), data.size() * sizeof(int32_t));

    // Create a compressor instance
    openzl::Compressor compressor(graph);

    // Perform compression!
    openzl::Buffer compressed_buffer = compressor.compress(input_data_map);

    std::cout << "Compressed data size: " << compressed_buffer.size() << " bytes" << std::endl;
    std::cout << "Compression ratio: " << (double)data.size() * sizeof(int32_t) / compressed_buffer.size() << ":1" << std::endl;

    // --- Now, let's decompress it to verify ---
    openzl::Decompressor decompressor(graph);

    // Decompression requires a map of node names to output buffers.
    // We need to allocate memory for the decompressed data.
    std::vector<int32_t> decompressed_data(data.size());
    std::map<std::string, openzl::Buffer> output_data_map;
    output_data_map["my_integers"] = openzl::Buffer(decompressed_data.data(), decompressed_data.size() * sizeof(int32_t));

    decompressor.decompress(compressed_buffer, output_data_map);

    // Verify the data
    bool success = true;
    for (size_t i = 0; i < data.size(); ++i) {
        if (data[i] != decompressed_data[i]) {
            std::cerr << "Mismatch at index " << i << ": original=" << data[i] << ", decompressed=" << decompressed_data[i] << std::endl;
            success = false;
            break;
        }
    }

    if (success) {
        std::cout << "Decompression successful! Data matches original." << std::endl;
    } else {
        std::cerr << "Decompression FAILED!" << std::endl;
    }

    return 0;
}

Step 2: Compile and Run

To compile this, you’ll typically use cmake and g++ (or clang++). Assuming you built OpenZL in a build directory, you might compile like this:

# Assuming you are in your project directory, sibling to OpenZL's build directory
mkdir build_example
cd build_example
cmake .. -DCMAKE_PREFIX_PATH=/path/to/openzl/build # Adjust /path/to/openzl/build
make
./codec_example

Explanation of the Code:

Includes: We include <openzl/openzl.h> for general OpenZL functionality, <openzl/codecs/integer_codec.h> specifically for the integer codec, and <openzl/data_graph.h> to define our data structure.
Sample Data: A std::vector<int32_t> holds our data. We print its original size.
openzl::DataGraph graph;: This line creates an empty data graph. In more complex scenarios, this graph would describe the intricate relationships within your structured data.
auto& node = graph.add_node("my_integers");: We add a single node to our graph and give it a name, “my_integers”. This node represents the stream of integer data we want to compress.
node.set_codec<openzl::codecs::IntegerCodec<int32_t>>();: This is the crucial step! We tell OpenZL that the data represented by the “my_integers” node should be handled by the IntegerCodec, specifically for int32_t types. OpenZL’s IntegerCodec is designed to efficiently compress integer sequences.
input_data_map: OpenZL expects data as a map from node names to openzl::Buffer objects. An openzl::Buffer is essentially a pointer to raw memory and its size. We create one for our integer vector.
openzl::Compressor compressor(graph);: We instantiate a Compressor object, providing it with our defined DataGraph. The compressor now knows how to compress data that conforms to this graph.
compressor.compress(input_data_map);: We call the compress method, passing our input data. OpenZL uses the graph and the assigned codecs to compress the data.
Decompression: The decompression process mirrors compression. We create a Decompressor, allocate an output buffer of the original expected size, and call decompress. Finally, we verify the decompressed data against the original.

Mini-Challenge: Compressing Repetitive Strings

Now it’s your turn to apply what you’ve learned!

Challenge: Modify the codec_example.cpp program to compress a std::vector<std::string> that contains some repetitive strings. Use an appropriate built-in OpenZL codec for strings.

Hint:

You’ll need to include a different codec header, likely something like <openzl/codecs/string_codec.h> or <openzl/codecs/dictionary_codec.h>. The StringCodec is a good starting point, but if you have many repeated strings, a DictionaryCodec might yield better results. OpenZL’s StringCodec often has dictionary-like capabilities built-in for common patterns.
The openzl::Buffer for strings needs to be handled carefully. Typically, you’d flatten the strings into a single contiguous buffer (e.g., by concatenating them with null terminators or length prefixes) and then provide the offsets/lengths as metadata, or use a codec that directly understands std::vector<std::string> if available (check OpenZL docs for StringCodec constructors). For simplicity, let’s assume a basic StringCodec that expects a flat buffer of concatenated strings (e.g., each string followed by its length or a delimiter, depending on the codec’s specific implementation). A more robust approach might involve a DataGraph with a node for string data and another for string offsets/lengths. For this challenge, just try to get some string data compressed.

What to observe/learn:

How does the compression ratio change with different string patterns (e.g., highly repetitive vs. unique strings)?
How does assigning the correct codec impact efficiency?

Common Pitfalls & Troubleshooting

Incorrect Codec Choice: The most common mistake is using a codec that doesn’t match your data type or its patterns. For example, trying to compress floating-point numbers with an IntegerCodec will lead to incorrect results or errors. Always consult the OpenZL documentation for the recommended codec for your data type and expected patterns.
Misrepresenting Data Structure: If your DataGraph doesn’t accurately reflect your data’s layout, OpenZL won’t be able to apply codecs correctly. For instance, if you have a struct with an int and a float, but your DataGraph only defines an int node, the float data will be misinterpreted.
Buffer Mismatches (Size/Type): Ensure that the openzl::Buffer you provide for compression and decompression matches the exact size and type expected by the DataGraph nodes and their assigned codecs. A common error is providing a Buffer that’s too small for decompression, leading to crashes or corrupted data.
Missing Headers: For each specific codec you use (e.g., IntegerCodec, StringCodec), make sure you include its corresponding header file.

Summary

Phew! You’ve just taken a significant step in mastering OpenZL. Here’s a quick recap of what we covered:

Codecs as Building Blocks: OpenZL leverages specialized codecs as modular components for compression.
Format-Aware Advantage: By assigning the right codec to the right data type within a DataGraph, OpenZL achieves superior compression ratios compared to generic compressors.
Variety of Codecs: OpenZL provides built-in codecs for various data types like integers, floats, strings, and for common patterns like run-length encoding.
Practical Application: You implemented a basic C++ program to compress and decompress integer data using OpenZL’s IntegerCodec, observing the process firsthand.
Key Steps: Defining a DataGraph, adding nodes, assigning specific codecs using set_codec<>, and then using Compressor and Decompressor objects.
Troubleshooting: We identified common issues like incorrect codec selection, DataGraph inaccuracies, and buffer mismatches.

You’re now equipped with the knowledge to start choosing the right tools from OpenZL’s codec arsenal. In the next chapter, we’ll explore more advanced topics, such as combining multiple codecs within a single DataGraph and even delving into how to build custom codecs for truly unique data formats. Keep experimenting and happy compressing!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.