Welcome back, aspiring data magician! In the previous chapters, we laid the groundwork by exploring what OpenZL is, why it’s a game-changer for structured data compression, and how to get your development environment ready. You’re now equipped with the tools and the foundational knowledge.
In this exciting chapter, we’re going to roll up our sleeves and build our very first custom compressor using OpenZL. Think of this as your “Hello World” moment for format-aware compression. We’ll define a simple data structure, translate it into an OpenZL schema, and then use OpenZL to generate a specialized compressor that can efficiently handle data matching our structure. By the end, you’ll have compressed and decompressed your own custom data, gaining invaluable hands-on experience and a deeper appreciation for OpenZL’s power.
This chapter will connect the abstract concepts we discussed earlier with concrete code, showing you exactly how OpenZL’s schema-driven approach works in practice. Get ready to see your custom compressor come to life!
Core Concepts: The Schema and the Codecs
Before we dive into writing code, let’s briefly revisit the core ideas that make OpenZL so powerful, specifically how they relate to building a custom compressor.
The Data Schema: Your Data’s Blueprint
At the heart of OpenZL’s format-aware compression is the Schema. Imagine your data isn’t just a blob of bytes, but a structured entity – like a database table, a JSON object, or even a simple log entry. This structure has fields, types, and relationships.
An OpenZL schema is essentially a blueprint of your data. You describe the exact format of your input data to OpenZL. This description isn’t about how to compress, but what the data looks like. Why is this important? Because once OpenZL understands your data’s structure, it can devise a highly optimized compression strategy tailored specifically for it. This is the “format-aware” part!
Think of it this way: If you want to build a custom box for a specific toy, you first need to know the toy’s shape, size, and material. The schema is that detailed description of the toy.
Codecs: The Compression Building Blocks
Once OpenZL has your data’s blueprint (the schema), it doesn’t invent compression algorithms from scratch. Instead, it uses a library of existing, highly optimized compression primitives called Codecs. These codecs are like specialized tools in a toolbox: there are codecs for integers, strings, floating-point numbers, booleans, and more. Some codecs are general-purpose (like dictionary compression), while others are highly specific (like delta encoding for sequential numbers).
OpenZL intelligently selects and arranges these codecs based on your schema to create a “compression plan” or a “compression graph.” This graph represents the flow of your data through a series of codecs, each designed to compress a specific part of your structured data most effectively.
Let’s visualize this process with a simple diagram:
In this diagram, you can see how your data’s description (the schema) is the key that unlocks the creation of a highly efficient, specialized compressor from OpenZL’s toolkit of codecs.
Step-by-Step Implementation: Building Our “Hello World” Compressor
Let’s get practical! We’ll define a super simple data structure: a message containing a tag (string) and a value (integer). Then, we’ll build an OpenZL compressor for it.
Prerequisites: Ensure you have OpenZL built and installed, as covered in Chapter 3. We’ll assume you’re working within a C++ project that can link against the OpenZL library.
Step 1: Define Our Simple Data Structure
First, let’s represent our “Hello World” data in C++. Create a new C++ file (e.g., hello_compressor.cpp) and add this structure:
// hello_compressor.cpp
#include <string>
#include <vector>
#include <iostream>
// Our simple data structure
struct MyMessage {
std::string tag;
int value;
};
This is the raw, uncompressed form of our data. Our goal is to compress a collection of MyMessage objects.
Step 2: Define the OpenZL Schema
Now, for the magic! We need to tell OpenZL about MyMessage. OpenZL provides a way to define schemas using C++ code. This involves using OpenZL’s type system to map our MyMessage fields.
Let’s add the OpenZL headers and begin defining our schema. We’ll include <openzl/schema/schema.h> and <openzl/schema/types.h>.
// hello_compressor.cpp (continued)
#include <openzl/schema/schema.h>
#include <openzl/schema/types.h>
#include <openzl/compressor/compressor.h> // For the compressor itself
#include <openzl/io/buffer.h> // For input/output buffers
// ... (MyMessage struct definition from above) ...
// Function to define our OpenZL schema
OpenZL::Schema defineMyMessageSchema() {
// Create a new schema named "MyMessageSchema"
OpenZL::Schema schema("MyMessageSchema");
// Define the 'tag' field as a string
// OpenZL::Type::String() represents a string type
schema.addField("tag", OpenZL::Type::String());
// Define the 'value' field as an integer
// OpenZL::Type::Int32() represents a 32-bit integer type
schema.addField("value", OpenZL::Type::Int32());
// For compressing a collection of these messages,
// we define the root of our schema as an array of MyMessage.
// OpenZL::Type::Array(schema) indicates an array where each element
// conforms to the 'schema' we just defined for MyMessage.
schema.setRootType(OpenZL::Type::Array(schema));
return schema;
}
Explanation:
OpenZL::Schema schema("MyMessageSchema");: We instantiate aSchemaobject, giving it a descriptive name.schema.addField("tag", OpenZL::Type::String());: We declare a field named “tag” and specify its type usingOpenZL::Type::String(). OpenZL has various primitive types likeInt32,Int64,Float,Double,Bool,String, etc.schema.addField("value", OpenZL::Type::Int32());: Similarly, we define the “value” field as a 32-bit integer.schema.setRootType(OpenZL::Type::Array(schema));: This is crucial. We’re not just compressing oneMyMessage, but an array (or vector in C++) of them. So, we tell OpenZL that the top-level structure it will receive for compression is anArraywhere each element itself conforms to ourMyMessageSchema.
Step 3: Build and Use the Custom Compressor
With our schema defined, we can now ask OpenZL to build a compressor. This involves a few steps:
- Instantiate the schema.
- Create a compressor builder.
- Build the compressor.
- Prepare data for compression.
- Compress the data.
- Decompress the data.
- Verify the results.
Let’s add the main function to hello_compressor.cpp:
// hello_compressor.cpp (continued)
// ... (includes and MyMessage struct, defineMyMessageSchema function from above) ...
int main() {
std::cout << "--- OpenZL Hello World Compressor ---" << std::endl;
// 1. Define and get our schema
OpenZL::Schema mySchema = defineMyMessageSchema();
std::cout << "Schema 'MyMessageSchema' defined successfully." << std::endl;
// 2. Create a compressor builder
// The CompressorBuilder takes the schema and optional configuration.
// For now, we'll stick to default configurations.
OpenZL::CompressorBuilder builder(mySchema);
// 3. Build the compressor
// This is where OpenZL analyzes the schema and generates the optimal
// compression graph and underlying codecs.
std::unique_ptr<OpenZL::Compressor> compressor = builder.build();
if (!compressor) {
std::cerr << "Failed to build OpenZL compressor!" << std::endl;
return 1;
}
std::cout << "Custom compressor built successfully." << std::endl;
// 4. Prepare some sample data
std::vector<MyMessage> originalData = {
{"status", 200},
{"event", 101},
{"status", 404},
{"event", 102},
{"status", 200}
};
std::cout << "Original data size (approx): " << (originalData.size() * (sizeof(int) + 10)) << " bytes (rough estimate for strings)." << std::endl;
// OpenZL expects data to be written into an OpenZL::Writer.
// We'll use a MemoryWriter for simplicity to simulate writing to a buffer.
OpenZL::MemoryWriter inputWriter;
for (const auto& msg : originalData) {
inputWriter.writeString(msg.tag);
inputWriter.writeInt32(msg.value);
}
// Get the raw buffer from the writer, which OpenZL will process.
OpenZL::Buffer inputBuffer = inputWriter.toBuffer();
std::cout << "Input buffer prepared with " << originalData.size() << " messages." << std::endl;
// 5. Compress the data
OpenZL::MemoryWriter compressedWriter;
compressor->compress(inputBuffer, compressedWriter);
OpenZL::Buffer compressedBuffer = compressedWriter.toBuffer();
std::cout << "Data compressed! Compressed size: " << compressedBuffer.size() << " bytes." << std::endl;
// 6. Decompress the data
OpenZL::MemoryWriter decompressedWriter;
compressor->decompress(compressedBuffer, decompressedWriter);
OpenZL::Buffer decompressedBuffer = decompressedWriter.toBuffer();
std::cout << "Data decompressed! Decompressed size: " << decompressedBuffer.size() << " bytes." << std::endl;
// 7. Verify the results
OpenZL::MemoryReader reader(decompressedBuffer);
std::vector<MyMessage> decompressedData;
for (size_t i = 0; i < originalData.size(); ++i) {
MyMessage msg;
msg.tag = reader.readString();
msg.value = reader.readInt32();
decompressedData.push_back(msg);
}
bool success = true;
for (size_t i = 0; i < originalData.size(); ++i) {
if (originalData[i].tag != decompressedData[i].tag ||
originalData[i].value != decompressedData[i].value) {
std::cerr << "Verification failed at index " << i << "!" << std::endl;
std::cerr << "Original: {" << originalData[i].tag << ", " << originalData[i].value << "}" << std::endl;
std::cerr << "Decompressed: {" << decompressedData[i].tag << ", " << decompressedData[i].value << "}" << std::endl;
success = false;
break;
}
}
if (success) {
std::cout << "Verification successful! Original and decompressed data match." << std::endl;
std::cout << "Congratulations on your first custom OpenZL compressor!" << std::endl;
} else {
std::cerr << "Something went wrong during compression/decompression." << std::endl;
return 1;
}
return 0;
}
Explanation of the main function:
- Schema Instantiation: We call
defineMyMessageSchema()to get our schema object. CompressorBuilder: This class takes your schema and prepares to build the actual compressor. It’s the entry point to OpenZL’s optimization engine.builder.build(): This is where OpenZL performs its magic! It analyzes your schema, selects the best codecs, and constructs the internal compression graph. It returns astd::unique_ptr<OpenZL::Compressor>.OpenZL::MemoryWriterandOpenZL::MemoryReader: These are utility classes provided by OpenZL for writing data into memory buffers and reading from them, respectively. They handle the serialization of your structured data into a byte stream that OpenZL can process. We manually write each field (writeString,writeInt32) in the order defined by our schema.compressor->compress(inputBuffer, compressedWriter): This is the actual compression call. It takes the input buffer (our raw data) and writes the compressed output tocompressedWriter.compressor->decompress(compressedBuffer, decompressedWriter): The decompression call, taking the compressed data and writing the original data back todecompressedWriter.- Verification: We read the data back from the
decompressedBufferusingMemoryReaderand compare it element by element with ouroriginalDatato ensure lossless compression.
Step 4: Compile and Run (with CMake)
To compile this, you’ll need to link against the OpenZL library. Assuming you followed Chapter 3 for OpenZL setup, here’s a basic CMakeLists.txt for your project:
# CMakeLists.txt
cmake_minimum_required(VERSION 3.14)
project(OpenZLHelloWorld CXX)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# Find OpenZL. Assuming it's installed in a standard location
# or that you've configured CMAKE_PREFIX_PATH to point to its install directory.
# This assumes OpenZL provides a find_package configuration.
find_package(OpenZL REQUIRED)
add_executable(hello_compressor hello_compressor.cpp)
# Link your executable against the OpenZL library
target_link_libraries(hello_compressor PRIVATE OpenZL::OpenZL)
Compilation Steps:
- Save the C++ code as
hello_compressor.cppand the CMake code asCMakeLists.txtin the same directory. - Create a
builddirectory:mkdir build && cd build - Configure CMake:
cmake ..- Troubleshooting Hint: If
find_package(OpenZL REQUIRED)fails, ensure OpenZL is correctly installed and its installation path is known to CMake (e.g., by settingCMAKE_PREFIX_PATHbefore runningcmake).
- Troubleshooting Hint: If
- Build:
cmake --build . - Run:
./hello_compressor
You should see output similar to this:
--- OpenZL Hello World Compressor ---
Schema 'MyMessageSchema' defined successfully.
Custom compressor built successfully.
Original data size (approx): 50 bytes (rough estimate for strings).
Input buffer prepared with 5 messages.
Data compressed! Compressed size: XX bytes.
Data decompressed! Decompressed size: YY bytes.
Verification successful! Original and decompressed data match.
Congratulations on your first custom OpenZL compressor!
The compressed size XX will likely be significantly smaller than the original rough estimate, especially with more data or repetitive patterns. YY should be close to the original data’s actual size.
Mini-Challenge: Extend Your Schema
You’ve built your first compressor! Feeling good? Excellent! Now, let’s make a small tweak to solidify your understanding.
Challenge: Modify the MyMessage struct and its corresponding OpenZL schema to include an additional field: a timestamp (represented as a long long for milliseconds since epoch).
Steps:
- Update the
MyMessagestruct inhello_compressor.cpp. - Modify
defineMyMessageSchema()to add the newtimestampfield with its appropriate OpenZL type (OpenZL::Type::Int64()). - Adjust the
originalDatavector inmain()to include dummy timestamps for each message. - Update the
inputWriterandreaderlogic inmain()to write and read the newtimestampfield. - Recompile and run!
Hint: Remember to add schema.addField("timestamp", OpenZL::Type::Int64()); and ensure you write/read msg.timestamp in the correct order.
What to Observe/Learn:
- How easy it is to extend your data structure and update the schema.
- The impact on the compressed size (it will increase slightly, but OpenZL will still try to optimize it).
- The importance of keeping the C++ struct, OpenZL schema, and serialization/deserialization logic in sync.
Common Pitfalls & Troubleshooting
Schema Mismatch during Serialization/Deserialization:
- Problem: You define
tagthenvaluein your schema, but accidentally writevaluethentagto theMemoryWriter, or read them in the wrong order. This leads to corrupted data or crashes. - Solution: Always ensure the order of
addFieldcalls in your schema definition exactly matches the order youwritefields to theOpenZL::Writerandreadthem from theOpenZL::Reader. Consistency is key!
- Problem: You define
OpenZL Library Not Found (CMake Error):
- Problem: When running
cmake .., you get an error like “Could not find package OpenZL”. - Solution: This typically means CMake doesn’t know where OpenZL is installed.
- Ensure OpenZL was installed (e.g.,
make installafter building it). - If installed to a non-standard path, tell CMake by setting
CMAKE_PREFIX_PATH. For example, if OpenZL is installed in/opt/openzl, runcmake -D CMAKE_PREFIX_PATH=/opt/openzl .. - Verify that
OpenZLConfig.cmakeoropenzl-config.cmakeexists inCMAKE_PREFIX_PATH/lib/cmake/OpenZLor similar.
- Ensure OpenZL was installed (e.g.,
- Problem: When running
Compiler Errors (Missing Headers/Symbols):
- Problem: Your C++ code fails to compile with errors about undefined types or functions from OpenZL.
- Solution: Double-check your
#includedirectives. You need at least<openzl/schema/schema.h>,<openzl/schema/types.h>,<openzl/compressor/compressor.h>, and<openzl/io/buffer.h>for this example. Ensure yourtarget_link_librariesinCMakeLists.txtcorrectly links againstOpenZL::OpenZL.
Summary
Congratulations! You’ve successfully built and tested your first custom OpenZL compressor. Let’s recap what you accomplished in this chapter:
- Understood the role of the OpenZL Schema as the blueprint for your structured data.
- Learned how OpenZL Codecs are the building blocks intelligently selected by the framework.
- Defined a simple C++ data structure (
MyMessage). - Translated that structure into an OpenZL Schema using
OpenZL::SchemaandaddFieldwith appropriateOpenZL::Types. - Used the
CompressorBuilderto generate a specialized compressor. - Performed hands-on compression and decompression of your custom data using
OpenZL::MemoryWriterandOpenZL::MemoryReader. - Verified the integrity of your compressed and decompressed data.
- Tackled a mini-challenge to extend your schema, reinforcing your understanding.
- Identified common pitfalls and learned how to troubleshoot them.
You’ve taken a significant step from theory to practice with OpenZL. You now have a foundational understanding of how to describe your data to OpenZL and harness its power to create custom, highly efficient compression solutions.
In the next chapter, we’ll delve deeper into more complex schema definitions, exploring nested structures, optional fields, and how to represent more intricate data formats. Get ready to expand your OpenZL toolkit!
References
- OpenZL GitHub Repository
- Meta Engineering Blog: Introducing OpenZL
- OpenZL Documentation (Concepts)
- OpenZL Documentation (Using OpenZL)
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.