Welcome back, aspiring data compression expert! In the previous chapters, we laid the groundwork for understanding OpenZL’s architecture and setting up our environment. Now, it’s time to dive into the heart of OpenZL: building and executing compression plans. This is where OpenZL truly shines, allowing us to leverage its format-aware capabilities for superior compression of structured data.
In this chapter, we’ll walk through the complete OpenZL workflow, from describing your data’s shape to training an optimized compression plan and then using it to compress and decompress your files. Understanding this workflow is crucial, as it’s the foundation for achieving the best possible compression ratios and speeds for your specific datasets. Get ready to put your knowledge into practice and see OpenZL in action!
The OpenZL Workflow: An Overview
OpenZL isn’t a “one-size-fits-all” compressor; it’s a framework that learns how to best compress your data. This learning process is captured in what OpenZL calls a “compression plan.” The overall process can be visualized as a clear, sequential flow:
As you can see, the journey begins with your data and ends with compressed (and decompressed) data. The critical intermediate steps involve defining your data’s structure and training a plan. Let’s break down these core concepts.
Core Concepts: SDDL and Compression Plans
OpenZL’s power stems from its ability to understand the structure of your data. This understanding is achieved through two main components: SDDL for describing the data, and Compression Plans for orchestrating the compression process.
What is SDDL? (Simple Data Description Language)
Imagine you have a box of LEGOs. To build something specific, you first need to know what pieces you have and how they fit together, right? SDDL serves a similar purpose for OpenZL.
SDDL (Simple Data Description Language) is a domain-specific language that allows you to describe the precise structure of your structured data. Instead of OpenZL trying to guess your data’s format (like a generic compressor), you explicitly tell it. This “format-awareness” is what enables OpenZL to apply highly specialized and effective codecs to different parts of your data.
Why is it important?
- Targeted Compression: OpenZL can apply the most efficient codec for an
intto an integer field, a different one for afloat, and yet another for astring. - Improved Ratios: By understanding boundaries and types, OpenZL avoids common pitfalls of generic compressors that might treat a number as a string, leading to suboptimal results.
- Lossless Guarantee: When you decompress, OpenZL knows exactly how to reconstruct the original data because it understands its structure, ensuring perfect lossless recovery.
Let’s look at a simple example. Imagine we’re tracking sensor readings, each with a timestamp, temperature, and humidity.
// sensor_data.sddl
struct SensorReading {
timestamp: uint64; // Unix timestamp in milliseconds
temperature: float32; // Temperature in Celsius
humidity: float32; // Relative humidity percentage
}
In this SDDL snippet:
struct SensorReading: We’re defining a composite data type namedSensorReading. It’s like a blueprint for a single sensor record.timestamp: uint64;: This line declares a field namedtimestampthat is an unsigned 64-bit integer. OpenZL will know to use codecs optimized for large integers.temperature: float32;: This declares atemperaturefield as a 32-bit floating-point number. Again, specific floating-point codecs can be applied.humidity: float32;: Another 32-bit float for humidity.
SDDL supports various primitive types (uint8, int16, float64, bool), composite types (struct, array), and more advanced constructs. It’s designed to be simple yet expressive enough to describe complex data layouts. You can find comprehensive documentation on SDDL at the OpenZL official site (as of early 2026).
What are Compression Plans?
Once OpenZL understands your data’s structure via SDDL, it needs a strategy to compress it. This strategy is called a Compression Plan.
A Compression Plan in OpenZL is essentially a directed acyclic graph (DAG) of codecs. Think of it as a meticulously designed pipeline where different codecs (compression algorithms) are applied in a specific order to different parts of your data.
How are plans created? OpenZL doesn’t just pick a random plan. It trains a plan. This involves:
- Exploration: OpenZL, using your SDDL schema and a sample of your actual data, explores various combinations of available codecs and their parameters.
- Optimization: It evaluates these combinations based on your specified optimization goals (e.g., prioritize maximum compression ratio, prioritize fastest compression speed, or find a balance).
- Selection: It selects the “best” plan that meets your criteria for your specific data.
This training process is powerful because it tailors the compression strategy to your unique dataset, often outperforming generic compressors. The output of this training is a .plan file, which is a binary file containing the optimized graph of codecs.
Why train a plan?
- Optimal Performance: A plan is specifically tuned for your data, leading to better compression ratios or speeds than general-purpose algorithms.
- Adaptability: Different datasets have different characteristics. Training allows OpenZL to adapt its strategy.
- Reproducibility: Once a plan is trained, it can be used consistently across all your data, ensuring predictable results.
Step-by-Step Implementation: Compressing Sensor Data
Let’s get hands-on and implement the OpenZL workflow to compress our SensorReading data.
Scenario: We have a stream of sensor readings that we want to store efficiently. Each reading consists of a timestamp, temperature, and humidity.
Prerequisites: You should have OpenZL installed and configured as covered in Chapter 3. We’ll be using the openzl command-line tool.
Step 1: Define the Data Schema with SDDL
First, we need to tell OpenZL about our data’s structure.
Create a file named
sensor_data.sddlin your working directory.Add the following content to
sensor_data.sddl:// sensor_data.sddl struct SensorReading { timestamp: uint64; // Unix timestamp in milliseconds temperature: float32; // Temperature in Celsius humidity: float32; // Relative humidity percentage } // We'll be compressing a stream of these readings array<SensorReading> readings;Explanation:
- The
struct SensorReadingdefines the layout of a single sensor record, as discussed before. - The new line
array<SensorReading> readings;is crucial. It tells OpenZL that our input data will be an array (a sequence) ofSensorReadingstructs. This is a common pattern for time-series or tabular data.
- The
Step 2: Prepare Sample Data
OpenZL needs some real data to “learn” from during the training process. This sample data should be representative of the data you intend to compress. For our example, let’s create a small binary file containing a few SensorReading records.
Since directly writing binary data can be tricky, we’ll use a Python script to generate a sample file that strictly adheres to our SDDL schema.
Create a file named
generate_sample_data.pyin the same directory.Add the following Python code:
# generate_sample_data.py import struct import time def generate_sensor_data(filename="sample_sensor_data.bin", num_records=100): """Generates a binary file with sample SensorReading data.""" with open(filename, "wb") as f: for i in range(num_records): timestamp = int(time.time() * 1000) + i * 1000 # Milliseconds, increasing by 1 second temperature = 20.0 + (i % 10) * 0.5 # 20.0, 20.5, 21.0, ... humidity = 50.0 + (i % 7) * 1.5 # 50.0, 51.5, 53.0, ... # Pack data according to SDDL: uint64, float32, float32 # 'Q' for unsigned long long (uint64), 'f' for float (float32) packed_data = struct.pack('<Qff', timestamp, temperature, humidity) f.write(packed_data) print(f"Generated {num_records} sensor records to {filename}") if __name__ == "__main__": generate_sensor_data()Explanation:
- We use the
structmodule to pack our Python numbers into binary formats that match our SDDL types (<Qffmeans little-endian, unsigned long long, float, float). - The script generates
num_records(default 100) of plausible sensor data, withtimestampincreasing andtemperature/humidityshowing some variation.
- We use the
Run the Python script from your terminal:
python generate_sample_data.pyThis will create a file named
sample_sensor_data.bincontaining 100 sensor records. This is our sample data for training.
Step 3: Train a Compression Plan
Now that we have our data schema (SDDL) and sample data, we can train OpenZL to create an optimal compression plan.
Run the
openzl traincommand:openzl train \ --sddl sensor_data.sddl \ --input sample_sensor_data.bin \ --output sensor_data.plan \ --target-metric ratio \ --max-time 60sExplanation of arguments:
openzl train: The command to initiate plan training.--sddl sensor_data.sddl: Specifies our SDDL schema file. OpenZL uses this to understand the data structure.--input sample_sensor_data.bin: Provides the sample data for OpenZL to analyze and optimize against.--output sensor_data.plan: The path where the generated compression plan will be saved.--target-metric ratio: Tells OpenZL to prioritize the best compression ratio. You could also usespeedfor faster compression orbalancedfor a compromise.--max-time 60s: Limits the training process to a maximum of 60 seconds. Training can be computationally intensive, so setting a time limit is good practice. For real-world scenarios, you might allow it to run longer for better optimization.
Observe the output: OpenZL will print progress messages as it explores different codec combinations and evaluates them. After some time (up to 60 seconds), it will report the best plan found and save it to
sensor_data.plan.You should see output similar to (details will vary):
INFO: Training started... INFO: Exploring codec combinations... INFO: Best plan found (ratio: 0.15, speed: 1234 MB/s) INFO: Plan saved to sensor_data.planCongratulations! You’ve just trained your first OpenZL compression plan.
Step 4: Compress Data using the Plan
With our sensor_data.plan in hand, we can now compress actual data. Let’s compress our sample_sensor_data.bin file using the plan we just created.
Run the
openzl compresscommand:openzl compress \ --input sample_sensor_data.bin \ --plan sensor_data.plan \ --output compressed_sensor_data.zlExplanation of arguments:
openzl compress: The command to perform compression.--input sample_sensor_data.bin: The data file we want to compress.--plan sensor_data.plan: The pre-trained compression plan to use.--output compressed_sensor_data.zl: The path where the compressed output will be saved. The.zlextension is a common convention for OpenZL compressed files.
OpenZL will quickly process the file and save the compressed version. Compare the size of
sample_sensor_data.binwithcompressed_sensor_data.zl. You should see a significant reduction!
Step 5: Decompress Data
The final step in our workflow is to decompress the data back to its original form, verifying that our lossless compression worked perfectly.
Run the
openzl decompresscommand:openzl decompress \ --input compressed_sensor_data.zl \ --plan sensor_data.plan \ --output decompressed_sensor_data.binExplanation of arguments:
openzl decompress: The command to perform decompression.--input compressed_sensor_data.zl: The compressed file.--plan sensor_data.plan: The same plan used for compression is required for decompression. This is how OpenZL knows how to reverse the process.--output decompressed_sensor_data.bin: The path where the decompressed output will be saved.
Verify the data: To confirm lossless compression, you can compare the original
sample_sensor_data.binwith thedecompressed_sensor_data.binfile. They should be byte-for-byte identical. On Linux/macOS, you can usediff:diff sample_sensor_data.bin decompressed_sensor_data.binIf
diffreturns no output, the files are identical! Mission accomplished.
Mini-Challenge: Expanding Your Schema
You’ve successfully compressed and decompressed data using OpenZL’s core workflow. Now, let’s test your understanding with a small modification.
Challenge:
Imagine our sensor also started reporting pressure as a 32-bit floating-point number.
- Modify
sensor_data.sddlto include apressure: float32;field within theSensorReadingstruct. - Modify
generate_sample_data.pyto include apressurevalue (e.g.,1013.25 + (i % 5) * 0.1) and pack it correctly into the binary data (remember to add another'f'to thestruct.packformat string). - Re-generate
sample_sensor_data.binusing the updated Python script. - Re-train a new compression plan (you can overwrite
sensor_data.planor save it assensor_data_v2.plan). - Re-compress and decompress the new
sample_sensor_data.binusing your new plan. - Verify the decompressed data.
Hint: Pay close attention to the order of fields in your SDDL and how you pack them in Python. They must match exactly!
What to Observe/Learn: This exercise reinforces the direct relationship between your data’s structure, its SDDL definition, and the training process. You’ll see how OpenZL adapts its plan when the underlying data schema changes, highlighting its flexibility.
Common Pitfalls & Troubleshooting
Even with a clear workflow, you might encounter issues. Here are a few common pitfalls:
SDDL-Data Mismatch:
- Problem: Your
sample_sensor_data.bindoes not actually conform to the structure defined insensor_data.sddl. For example, you defineduint64but packed auint32, or you forgot a field in your Python script. - Symptom:
openzl trainoropenzl compressmight fail with “data parsing error,” “schema mismatch,” or “unexpected end of input.” - Solution: Double-check your SDDL file and your data generation script (e.g.,
generate_sample_data.py). Ensure every field’s type and order in the binary data exactly matches the SDDL. Usestructformat codes carefully.
- Problem: Your
Insufficient or Non-Representative Sample Data:
- Problem: Your
sample_sensor_data.binis too small, or it doesn’t contain the full range of values/patterns present in your real data. - Symptom: The trained plan might yield poor compression ratios on your actual, larger dataset, or it might be slower than expected.
- Solution: Provide a larger, diverse sample of your real data for training. If your data has distinct phases or outliers, try to include examples of these in your training set.
- Problem: Your
Training Time vs. Quality:
- Problem: You set
--max-timetoo low, and OpenZL couldn’t find an optimal plan. Or you let it run too long for minimal gain. - Symptom: Suboptimal compression ratio/speed, or training takes an excessively long time.
- Solution: Experiment with
--max-time. For critical applications, allow more time for training on a powerful machine. For less critical data, a shorter training might be acceptable. Monitor the output ofopenzl trainto see if the “best plan found” metrics are still improving significantly towards the end of the allocated time.
- Problem: You set
Summary
In this chapter, you’ve gained a deep understanding of the core OpenZL workflow and put it into practice:
- SDDL (Simple Data Description Language): You learned how to precisely describe the structure of your data, enabling OpenZL’s format-aware compression.
- Compression Plans: You understood that these are optimized DAGs of codecs, tailored to your specific data through a training process.
- Hands-on Workflow: You successfully defined an SDDL schema, generated sample data, trained a compression plan, and then used it to compress and decompress binary sensor data.
- Troubleshooting: You’re now aware of common issues like SDDL-data mismatches and how to address them.
You’ve built a solid foundation for using OpenZL effectively. In the next chapter, we’ll explore more advanced SDDL features and delve deeper into how OpenZL integrates with existing systems, opening up even more possibilities for efficient data handling!
References
- OpenZL Official GitHub Repository
- OpenZL Concepts Documentation
- OpenZL SDDL Introduction
- Python
structModule Documentation
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.