Chapter 4: Your First Custom Compressor: A "Hello World" Example

Welcome back, aspiring data magician! In the previous chapters, we laid the groundwork by exploring what OpenZL is, why it’s a game-changer for structured data compression, and how to get your development environment ready. You’re now equipped with the tools and the foundational knowledge.

In this exciting chapter, we’re going to roll up our sleeves and build our very first custom compressor using OpenZL. Think of this as your “Hello World” moment for format-aware compression. We’ll define a simple data structure, translate it into an OpenZL schema, and then use OpenZL to generate a specialized compressor that can efficiently handle data matching our structure. By the end, you’ll have compressed and decompressed your own custom data, gaining invaluable hands-on experience and a deeper appreciation for OpenZL’s power.

This chapter will connect the abstract concepts we discussed earlier with concrete code, showing you exactly how OpenZL’s schema-driven approach works in practice. Get ready to see your custom compressor come to life!

Core Concepts: The Schema and the Codecs

Before we dive into writing code, let’s briefly revisit the core ideas that make OpenZL so powerful, specifically how they relate to building a custom compressor.

The Data Schema: Your Data’s Blueprint

At the heart of OpenZL’s format-aware compression is the Schema. Imagine your data isn’t just a blob of bytes, but a structured entity – like a database table, a JSON object, or even a simple log entry. This structure has fields, types, and relationships.

An OpenZL schema is essentially a blueprint of your data. You describe the exact format of your input data to OpenZL. This description isn’t about how to compress, but what the data looks like. Why is this important? Because once OpenZL understands your data’s structure, it can devise a highly optimized compression strategy tailored specifically for it. This is the “format-aware” part!

Think of it this way: If you want to build a custom box for a specific toy, you first need to know the toy’s shape, size, and material. The schema is that detailed description of the toy.

Codecs: The Compression Building Blocks

Once OpenZL has your data’s blueprint (the schema), it doesn’t invent compression algorithms from scratch. Instead, it uses a library of existing, highly optimized compression primitives called Codecs. These codecs are like specialized tools in a toolbox: there are codecs for integers, strings, floating-point numbers, booleans, and more. Some codecs are general-purpose (like dictionary compression), while others are highly specific (like delta encoding for sequential numbers).

OpenZL intelligently selects and arranges these codecs based on your schema to create a “compression plan” or a “compression graph.” This graph represents the flow of your data through a series of codecs, each designed to compress a specific part of your structured data most effectively.

Let’s visualize this process with a simple diagram:

In this diagram, you can see how your data’s description (the schema) is the key that unlocks the creation of a highly efficient, specialized compressor from OpenZL’s toolkit of codecs.

Step-by-Step Implementation: Building Our “Hello World” Compressor

Let’s get practical! We’ll define a super simple data structure: a message containing a tag (string) and a value (integer). Then, we’ll build an OpenZL compressor for it.

Prerequisites: Ensure you have OpenZL built and installed, as covered in Chapter 3. We’ll assume you’re working within a C++ project that can link against the OpenZL library.

Step 1: Define Our Simple Data Structure

First, let’s represent our “Hello World” data in C++. Create a new C++ file (e.g., hello_compressor.cpp) and add this structure:

// hello_compressor.cpp
#include <string>
#include <vector>
#include <iostream>

// Our simple data structure
struct MyMessage {
    std::string tag;
    int value;
};

This is the raw, uncompressed form of our data. Our goal is to compress a collection of MyMessage objects.

Step 2: Define the OpenZL Schema

Now, for the magic! We need to tell OpenZL about MyMessage. OpenZL provides a way to define schemas using C++ code. This involves using OpenZL’s type system to map our MyMessage fields.

Let’s add the OpenZL headers and begin defining our schema. We’ll include <openzl/schema/schema.h> and <openzl/schema/types.h>.

// hello_compressor.cpp (continued)
#include <openzl/schema/schema.h>
#include <openzl/schema/types.h>
#include <openzl/compressor/compressor.h> // For the compressor itself
#include <openzl/io/buffer.h>             // For input/output buffers

// ... (MyMessage struct definition from above) ...

// Function to define our OpenZL schema
OpenZL::Schema defineMyMessageSchema() {
    // Create a new schema named "MyMessageSchema"
    OpenZL::Schema schema("MyMessageSchema");

    // Define the 'tag' field as a string
    // OpenZL::Type::String() represents a string type
    schema.addField("tag", OpenZL::Type::String());

    // Define the 'value' field as an integer
    // OpenZL::Type::Int32() represents a 32-bit integer type
    schema.addField("value", OpenZL::Type::Int32());

    // For compressing a collection of these messages,
    // we define the root of our schema as an array of MyMessage.
    // OpenZL::Type::Array(schema) indicates an array where each element
    // conforms to the 'schema' we just defined for MyMessage.
    schema.setRootType(OpenZL::Type::Array(schema));

    return schema;
}

Explanation:

OpenZL::Schema schema("MyMessageSchema");: We instantiate a Schema object, giving it a descriptive name.
schema.addField("tag", OpenZL::Type::String());: We declare a field named “tag” and specify its type using OpenZL::Type::String(). OpenZL has various primitive types like Int32, Int64, Float, Double, Bool, String, etc.
schema.addField("value", OpenZL::Type::Int32());: Similarly, we define the “value” field as a 32-bit integer.
schema.setRootType(OpenZL::Type::Array(schema));: This is crucial. We’re not just compressing one MyMessage, but an array (or vector in C++) of them. So, we tell OpenZL that the top-level structure it will receive for compression is an Array where each element itself conforms to our MyMessageSchema.

Step 3: Build and Use the Custom Compressor

With our schema defined, we can now ask OpenZL to build a compressor. This involves a few steps:

Instantiate the schema.
Create a compressor builder.
Build the compressor.
Prepare data for compression.
Compress the data.
Decompress the data.
Verify the results.

Let’s add the main function to hello_compressor.cpp:

// hello_compressor.cpp (continued)
// ... (includes and MyMessage struct, defineMyMessageSchema function from above) ...

int main() {
    std::cout << "--- OpenZL Hello World Compressor ---" << std::endl;

    // 1. Define and get our schema
    OpenZL::Schema mySchema = defineMyMessageSchema();
    std::cout << "Schema 'MyMessageSchema' defined successfully." << std::endl;

    // 2. Create a compressor builder
    // The CompressorBuilder takes the schema and optional configuration.
    // For now, we'll stick to default configurations.
    OpenZL::CompressorBuilder builder(mySchema);

    // 3. Build the compressor
    // This is where OpenZL analyzes the schema and generates the optimal
    // compression graph and underlying codecs.
    std::unique_ptr<OpenZL::Compressor> compressor = builder.build();
    if (!compressor) {
        std::cerr << "Failed to build OpenZL compressor!" << std::endl;
        return 1;
    }
    std::cout << "Custom compressor built successfully." << std::endl;

    // 4. Prepare some sample data
    std::vector<MyMessage> originalData = {
        {"status", 200},
        {"event", 101},
        {"status", 404},
        {"event", 102},
        {"status", 200}
    };
    std::cout << "Original data size (approx): " << (originalData.size() * (sizeof(int) + 10)) << " bytes (rough estimate for strings)." << std::endl;

    // OpenZL expects data to be written into an OpenZL::Writer.
    // We'll use a MemoryWriter for simplicity to simulate writing to a buffer.
    OpenZL::MemoryWriter inputWriter;
    for (const auto& msg : originalData) {
        inputWriter.writeString(msg.tag);
        inputWriter.writeInt32(msg.value);
    }
    // Get the raw buffer from the writer, which OpenZL will process.
    OpenZL::Buffer inputBuffer = inputWriter.toBuffer();
    std::cout << "Input buffer prepared with " << originalData.size() << " messages." << std::endl;

    // 5. Compress the data
    OpenZL::MemoryWriter compressedWriter;
    compressor->compress(inputBuffer, compressedWriter);
    OpenZL::Buffer compressedBuffer = compressedWriter.toBuffer();
    std::cout << "Data compressed! Compressed size: " << compressedBuffer.size() << " bytes." << std::endl;

    // 6. Decompress the data
    OpenZL::MemoryWriter decompressedWriter;
    compressor->decompress(compressedBuffer, decompressedWriter);
    OpenZL::Buffer decompressedBuffer = decompressedWriter.toBuffer();
    std::cout << "Data decompressed! Decompressed size: " << decompressedBuffer.size() << " bytes." << std::endl;

    // 7. Verify the results
    OpenZL::MemoryReader reader(decompressedBuffer);
    std::vector<MyMessage> decompressedData;
    for (size_t i = 0; i < originalData.size(); ++i) {
        MyMessage msg;
        msg.tag = reader.readString();
        msg.value = reader.readInt32();
        decompressedData.push_back(msg);
    }

    bool success = true;
    for (size_t i = 0; i < originalData.size(); ++i) {
        if (originalData[i].tag != decompressedData[i].tag ||
            originalData[i].value != decompressedData[i].value) {
            std::cerr << "Verification failed at index " << i << "!" << std::endl;
            std::cerr << "Original: {" << originalData[i].tag << ", " << originalData[i].value << "}" << std::endl;
            std::cerr << "Decompressed: {" << decompressedData[i].tag << ", " << decompressedData[i].value << "}" << std::endl;
            success = false;
            break;
        }
    }

    if (success) {
        std::cout << "Verification successful! Original and decompressed data match." << std::endl;
        std::cout << "Congratulations on your first custom OpenZL compressor!" << std::endl;
    } else {
        std::cerr << "Something went wrong during compression/decompression." << std::endl;
        return 1;
    }

    return 0;
}

Explanation of the main function:

Schema Instantiation: We call defineMyMessageSchema() to get our schema object.
CompressorBuilder: This class takes your schema and prepares to build the actual compressor. It’s the entry point to OpenZL’s optimization engine.
builder.build(): This is where OpenZL performs its magic! It analyzes your schema, selects the best codecs, and constructs the internal compression graph. It returns a std::unique_ptr<OpenZL::Compressor>.
OpenZL::MemoryWriter and OpenZL::MemoryReader: These are utility classes provided by OpenZL for writing data into memory buffers and reading from them, respectively. They handle the serialization of your structured data into a byte stream that OpenZL can process. We manually write each field (writeString, writeInt32) in the order defined by our schema.
compressor->compress(inputBuffer, compressedWriter): This is the actual compression call. It takes the input buffer (our raw data) and writes the compressed output to compressedWriter.
compressor->decompress(compressedBuffer, decompressedWriter): The decompression call, taking the compressed data and writing the original data back to decompressedWriter.
Verification: We read the data back from the decompressedBuffer using MemoryReader and compare it element by element with our originalData to ensure lossless compression.

Step 4: Compile and Run (with CMake)

To compile this, you’ll need to link against the OpenZL library. Assuming you followed Chapter 3 for OpenZL setup, here’s a basic CMakeLists.txt for your project:

# CMakeLists.txt
cmake_minimum_required(VERSION 3.14)
project(OpenZLHelloWorld CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Find OpenZL. Assuming it's installed in a standard location
# or that you've configured CMAKE_PREFIX_PATH to point to its install directory.
# This assumes OpenZL provides a find_package configuration.
find_package(OpenZL REQUIRED)

add_executable(hello_compressor hello_compressor.cpp)

# Link your executable against the OpenZL library
target_link_libraries(hello_compressor PRIVATE OpenZL::OpenZL)

Compilation Steps:

Save the C++ code as hello_compressor.cpp and the CMake code as CMakeLists.txt in the same directory.
Create a build directory: mkdir build && cd build
Configure CMake: cmake ..
- Troubleshooting Hint: If find_package(OpenZL REQUIRED) fails, ensure OpenZL is correctly installed and its installation path is known to CMake (e.g., by setting CMAKE_PREFIX_PATH before running cmake).
Build: cmake --build .
Run: ./hello_compressor

You should see output similar to this:

--- OpenZL Hello World Compressor ---
Schema 'MyMessageSchema' defined successfully.
Custom compressor built successfully.
Original data size (approx): 50 bytes (rough estimate for strings).
Input buffer prepared with 5 messages.
Data compressed! Compressed size: XX bytes.
Data decompressed! Decompressed size: YY bytes.
Verification successful! Original and decompressed data match.
Congratulations on your first custom OpenZL compressor!

The compressed size XX will likely be significantly smaller than the original rough estimate, especially with more data or repetitive patterns. YY should be close to the original data’s actual size.

Mini-Challenge: Extend Your Schema

You’ve built your first compressor! Feeling good? Excellent! Now, let’s make a small tweak to solidify your understanding.

Challenge: Modify the MyMessage struct and its corresponding OpenZL schema to include an additional field: a timestamp (represented as a long long for milliseconds since epoch).

Steps:

Update the MyMessage struct in hello_compressor.cpp.
Modify defineMyMessageSchema() to add the new timestamp field with its appropriate OpenZL type (OpenZL::Type::Int64()).
Adjust the originalData vector in main() to include dummy timestamps for each message.
Update the inputWriter and reader logic in main() to write and read the new timestamp field.
Recompile and run!

Hint: Remember to add schema.addField("timestamp", OpenZL::Type::Int64()); and ensure you write/read msg.timestamp in the correct order.

What to Observe/Learn:

How easy it is to extend your data structure and update the schema.
The impact on the compressed size (it will increase slightly, but OpenZL will still try to optimize it).
The importance of keeping the C++ struct, OpenZL schema, and serialization/deserialization logic in sync.

Common Pitfalls & Troubleshooting

Schema Mismatch during Serialization/Deserialization:
- Problem: You define tag then value in your schema, but accidentally write value then tag to the MemoryWriter, or read them in the wrong order. This leads to corrupted data or crashes.
- Solution: Always ensure the order of addField calls in your schema definition exactly matches the order you write fields to the OpenZL::Writer and read them from the OpenZL::Reader. Consistency is key!
OpenZL Library Not Found (CMake Error):
- Problem: When running cmake .., you get an error like “Could not find package OpenZL”.
- Solution: This typically means CMake doesn’t know where OpenZL is installed.
  - Ensure OpenZL was installed (e.g., make install after building it).
  - If installed to a non-standard path, tell CMake by setting CMAKE_PREFIX_PATH. For example, if OpenZL is installed in /opt/openzl, run cmake -D CMAKE_PREFIX_PATH=/opt/openzl ..
  - Verify that OpenZLConfig.cmake or openzl-config.cmake exists in CMAKE_PREFIX_PATH/lib/cmake/OpenZL or similar.
Compiler Errors (Missing Headers/Symbols):
- Problem: Your C++ code fails to compile with errors about undefined types or functions from OpenZL.
- Solution: Double-check your #include directives. You need at least <openzl/schema/schema.h>, <openzl/schema/types.h>, <openzl/compressor/compressor.h>, and <openzl/io/buffer.h> for this example. Ensure your target_link_libraries in CMakeLists.txt correctly links against OpenZL::OpenZL.

Summary

Congratulations! You’ve successfully built and tested your first custom OpenZL compressor. Let’s recap what you accomplished in this chapter:

Understood the role of the OpenZL Schema as the blueprint for your structured data.
Learned how OpenZL Codecs are the building blocks intelligently selected by the framework.
Defined a simple C++ data structure (MyMessage).
Translated that structure into an OpenZL Schema using OpenZL::Schema and addField with appropriate OpenZL::Types.
Used the CompressorBuilder to generate a specialized compressor.
Performed hands-on compression and decompression of your custom data using OpenZL::MemoryWriter and OpenZL::MemoryReader.
Verified the integrity of your compressed and decompressed data.
Tackled a mini-challenge to extend your schema, reinforcing your understanding.
Identified common pitfalls and learned how to troubleshoot them.

You’ve taken a significant step from theory to practice with OpenZL. You now have a foundational understanding of how to describe your data to OpenZL and harness its power to create custom, highly efficient compression solutions.

In the next chapter, we’ll delve deeper into more complex schema definitions, exploring nested structures, optional fields, and how to represent more intricate data formats. Get ready to expand your OpenZL toolkit!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.