Welcome to Chapter 15! This is where we bring everything we’ve learned about OpenZL together into an exciting, hands-on project. In the real world, data is often structured, and one of the most common forms is time-series data, particularly from sensors. Think about temperature readings, IoT device metrics, or stock prices – they all have a timestamp and one or more associated values.
In this chapter, you’ll learn how to leverage OpenZL’s unique format-aware compression capabilities to efficiently compress time-series sensor data. We’ll simulate a stream of sensor readings, define a schema that OpenZL can understand, and then use OpenZL to achieve impressive compression ratios. This project will solidify your understanding of OpenZL’s core concepts, from schema definition to the power of specialized codec graphs. Get ready to build something practical and see OpenZL in action!
Before we dive in, make sure you’re comfortable with the basics of OpenZL setup, schema definition, and the concept of codec graphs from previous chapters. We’ll be building on that foundation.
Core Concepts for Time-Series Compression
Time-series data presents both challenges and opportunities for compression. Let’s explore why OpenZL is particularly well-suited for this type of data.
What Makes Time-Series Data Special?
Imagine a sensor reporting temperature and humidity every second. What do you notice about this data?
- Temporal Order: Data points are inherently ordered by time.
- Regularity: Often, data arrives at fixed intervals (e.g., every second, minute).
- Redundancy: Consecutive readings are often very similar or follow predictable patterns (e.g., temperature doesn’t usually jump from 20°C to 50°C in a second).
- Structure: Each data point typically has a timestamp and one or more numerical values (e.g.,
{"timestamp": ..., "temperature": ..., "humidity": ...}).
Generic compressors like Gzip or Zstd are good at finding general byte patterns, but they don’t understand the semantic structure of your data. They don’t know that “temperature” is a floating-point number that changes slowly, or that “timestamp” is an ever-increasing integer. This is where OpenZL shines!
OpenZL’s Format-Aware Advantage
OpenZL, as we’ve learned, is a format-aware compression framework. Instead of treating your data as a flat stream of bytes, it takes a description of your data’s structure (a schema) and builds a specialized compressor tailored to that exact format.
For time-series data, this means:
- Schema-Driven Optimization: You define the types of your fields (e.g.,
int64for timestamp,float32for temperature). OpenZL uses this information to apply codecs best suited for each data type. - Codec Graph Specialization: OpenZL can choose specific codecs for timestamps (e.g., delta encoding for increasing values), and different codecs for sensor readings (e.g., run-length encoding for stable periods, or more advanced prediction-based codecs for gradual changes).
- Exploiting Redundancy: By understanding the data types and their typical behavior in time-series (e.g., small changes between consecutive values), OpenZL can apply techniques like differential encoding or specialized floating-point compression algorithms that generic compressors can’t.
Let’s visualize how OpenZL might process a simple time-series record.
Figure 15.1: A simplified OpenZL Codec Graph for a single time-series record. Each field is processed by a specialized codec before being combined.
In this diagram, the “Parse Schema” step allows OpenZL to identify individual fields. Then, based on the schema and potentially some training data, it selects specific codecs. For instance, timestamps often benefit from delta encoding because successive timestamps usually differ by a small, consistent amount. Temperature and humidity, being floating-point numbers, might use specialized floating-point compression techniques.
Step-by-Step Implementation: Compressing Sensor Data
For this project, we’ll simulate a stream of sensor data, define its structure, and then compress and decompress it using OpenZL. We’ll assume you have OpenZL installed and set up according to Chapter 2. We’ll use a high-level conceptual API representation, as OpenZL primarily offers C++ APIs, and the goal is to understand the process rather than specific language bindings.
Prerequisites:
- OpenZL library installed (refer to Chapter 2 for setup).
- A C++17 compatible compiler (e.g., GCC 9+, Clang 9+).
- CMake (version 3.15+).
We’ll work with a simplified C++ example to illustrate the concepts.
Step 1: Generating Sample Time-Series Data
First, let’s create some synthetic time-series data. Imagine a sensor reporting temperature and humidity.
// sensor_data_generator.h
#pragma once
#include <vector>
#include <string>
#include <chrono>
#include <random>
struct SensorReading {
int64_t timestamp_ms; // Milliseconds since epoch
float temperature_c; // Celsius
float humidity_percent; // Percentage
};
std::vector<SensorReading> generate_sensor_data(size_t num_readings) {
std::vector<SensorReading> data;
data.reserve(num_readings);
auto now = std::chrono::system_clock::now();
int64_t current_timestamp = std::chrono::duration_cast<std::chrono::milliseconds>(
now.time_since_epoch()
).count();
std::mt19937 gen(0); // For reproducibility
std::normal_distribution<float> temp_dist(25.0f, 1.0f); // Avg 25C, Std Dev 1C
std::normal_distribution<float> humid_dist(60.0f, 2.0f); // Avg 60%, Std Dev 2%
for (size_t i = 0; i < num_readings; ++i) {
SensorReading reading;
reading.timestamp_ms = current_timestamp + (i * 1000); // 1-second interval
reading.temperature_c = temp_dist(gen);
reading.humidity_percent = humid_dist(gen);
data.push_back(reading);
}
return data;
}
Explanation:
- We define a
SensorReadingstruct to hold our data points:timestamp_ms(along longfor milliseconds),temperature_c, andhumidity_percent(bothfloat). - The
generate_sensor_datafunction creates a vector of these structs. - It simulates readings taken every second, with temperature and humidity fluctuating slightly around averages using normal distributions. This kind of data is perfect for OpenZL!
Step 2: Defining the OpenZL Schema
Now, let’s tell OpenZL about the structure of our SensorReading. This is the crucial step where OpenZL gains its “format awareness.” OpenZL uses a schema definition language, often represented in a structured format like JSON or a programmatic builder. For our example, we’ll use a conceptual C++ builder pattern.
// openzl_schema_definition.h
#pragma once
#include <string>
#include <vector>
// Conceptual representation of OpenZL schema components
namespace OpenZL {
enum class DataType {
INT64,
FLOAT32,
// ... other types
};
struct Field {
std::string name;
DataType type;
// Potentially other metadata like 'is_monotonic', 'delta_encoding_candidate'
};
struct Schema {
std::string name;
std::vector<Field> fields;
// Conceptual method to register the schema with OpenZL
void register_with_openzl() const {
// In a real OpenZL API, this would involve calling a library function
// to pass this schema definition to the framework.
// E.g., OpenZL::SchemaRegistry::add(this->name, *this);
// For now, consider this a conceptual step.
std::cout << "--- Registering OpenZL Schema: " << name << " ---" << std::endl;
for (const auto& field : fields) {
std::cout << " Field: " << field.name << ", Type: ";
switch (field.type) {
case DataType::INT64: std::cout << "INT64"; break;
case DataType::FLOAT32: std::cout << "FLOAT32"; break;
}
std::cout << std::endl;
}
std::cout << "Schema registered conceptually." << std::endl;
}
};
Schema create_sensor_schema() {
Schema sensor_schema;
sensor_schema.name = "SensorReadingSchema";
sensor_schema.fields.push_back({"timestamp_ms", DataType::INT64});
sensor_schema.fields.push_back({"temperature_c", DataType::FLOAT32});
sensor_schema.fields.push_back({"humidity_percent", DataType::FLOAT32});
return sensor_schema;
}
} // namespace OpenZL
Explanation:
- We define conceptual
DataType,Field, andSchemastructs to represent OpenZL’s schema definition. - The
create_sensor_schemafunction builds our specific schema, mapping the field names and types from ourSensorReadingstruct. - The
register_with_openzlmethod is a placeholder for how you’d interact with the actual OpenZL library to make your schema known. This step is critical because it allows OpenZL to understand the logical structure of your data.
Step 3: Compressing the Data
With the schema defined, OpenZL can now build a specialized compression plan (a codec graph) and compress our data.
// main.cpp
#include "sensor_data_generator.h"
#include "openzl_schema_definition.h"
#include <iostream>
#include <vector>
#include <numeric> // For std::iota
// --- Conceptual OpenZL Compression/Decompression APIs ---
// In a real scenario, these would be actual library calls.
namespace OpenZL {
// Represents a compiled compression plan (codec graph)
struct Compressor {
std::string schema_name;
// Internal state for the actual compression logic
Compressor(const std::string& name) : schema_name(name) {
std::cout << "OpenZL: Initializing compressor for schema: " << schema_name << std::endl;
// In reality, this would involve OpenZL compiling a codec graph
// based on the registered schema and potentially training data.
}
// Conceptual compression function
std::vector<char> compress(const std::vector<SensorReading>& data) {
std::cout << "OpenZL: Compressing " << data.size() << " sensor readings..." << std::endl;
// Simulate compression logic. For structured data, OpenZL would apply
// field-specific codecs.
// For demonstration, let's just create a dummy compressed buffer.
size_t original_size = data.size() * sizeof(SensorReading);
size_t compressed_size = original_size / 4; // Simulate 4x compression
std::vector<char> compressed_buffer(compressed_size, 'Z'); // Fill with dummy data
std::cout << " Original size: " << original_size << " bytes" << std::endl;
std::cout << " Compressed size: " << compressed_buffer.size() << " bytes" << std::endl;
return compressed_buffer;
}
};
// Represents a compiled decompression plan
struct Decompressor {
std::string schema_name;
Decompressor(const std::string& name) : schema_name(name) {
std::cout << "OpenZL: Initializing decompressor for schema: " << schema_name << std::endl;
}
// Conceptual decompression function
std::vector<SensorReading> decompress(const std::vector<char>& compressed_data, size_t original_num_records) {
std::cout << "OpenZL: Decompressing " << compressed_data.size() << " bytes..." << std::endl;
// Simulate decompression logic.
std::vector<SensorReading> decompressed_data;
decompressed_data.reserve(original_num_records);
// For this conceptual example, we'll "re-generate" data similar to original
// This is NOT how real decompression works, but illustrates the outcome.
// In reality, OpenZL would reconstruct the original SensorReading objects.
auto now = std::chrono::system_clock::now();
int64_t current_timestamp = std::chrono::duration_cast<std::chrono::milliseconds>(
now.time_since_epoch()
).count();
std::mt19937 gen(0); // Same seed as generator for "matching" output
std::normal_distribution<float> temp_dist(25.0f, 1.0f);
std::normal_distribution<float> humid_dist(60.0f, 2.0f);
for (size_t i = 0; i < original_num_records; ++i) {
SensorReading reading;
reading.timestamp_ms = current_timestamp + (i * 1000);
reading.temperature_c = temp_dist(gen);
reading.humidity_percent = humid_dist(gen);
decompressed_data.push_back(reading);
}
std::cout << " Decompressed " << decompressed_data.size() << " sensor readings." << std::endl;
return decompressed_data;
}
};
} // namespace OpenZL
int main() {
// 1. Generate sample data
std::cout << "--- Generating Sample Sensor Data ---" << std::endl;
size_t num_readings = 1000;
std::vector<SensorReading> original_data = generate_sensor_data(num_readings);
std::cout << "Generated " << original_data.size() << " sensor readings." << std::endl;
std::cout << "First reading: Timestamp=" << original_data[0].timestamp_ms
<< ", Temp=" << original_data[0].temperature_c
<< ", Humidity=" << original_data[0].humidity_percent << std::endl;
std::cout << "Last reading: Timestamp=" << original_data.back().timestamp_ms
<< ", Temp=" << original_data.back().temperature_c
<< ", Humidity=" << original_data.back().humidity_percent << std::endl;
std::cout << std::endl;
// 2. Define and register OpenZL Schema
OpenZL::Schema sensor_schema = OpenZL::create_sensor_schema();
sensor_schema.register_with_openzl(); // Conceptually register
std::cout << std::endl;
// 3. Initialize OpenZL Compressor
OpenZL::Compressor compressor(sensor_schema.name);
std::cout << std::endl;
// 4. Compress the data
std::vector<char> compressed_buffer = compressor.compress(original_data);
std::cout << std::endl;
// 5. Initialize OpenZL Decompressor
OpenZL::Decompressor decompressor(sensor_schema.name);
std::cout << std::endl;
// 6. Decompress the data
std::vector<SensorReading> decompressed_data = decompressor.decompress(compressed_buffer, num_readings);
std::cout << std::endl;
// 7. Verification (conceptual)
// In a real scenario, you would compare original_data and decompressed_data
// element by element for equality.
std::cout << "--- Verifying Decompression (Conceptually) ---" << std::endl;
if (decompressed_data.size() == original_data.size()) {
std::cout << "Decompression successful! Number of records match." << std::endl;
// Real comparison would go here. For floats, use a small epsilon for comparison.
// Example: if (std::abs(original_data[0].temperature_c - decompressed_data[0].temperature_c) < 0.001f) { ... }
std::cout << "First decompressed reading: Timestamp=" << decompressed_data[0].timestamp_ms
<< ", Temp=" << decompressed_data[0].temperature_c
<< ", Humidity=" << decompressed_data[0].humidity_percent << std::endl;
} else {
std::cout << "Decompression failed: Record count mismatch!" << std::endl;
}
return 0;
}
Explanation of main.cpp:
- We include our
sensor_data_generator.handopenzl_schema_definition.h. - Conceptual
OpenZL::CompressorandOpenZL::Decompressorstructs are defined to mimic the actual OpenZL API calls.- The
compressmethod simulates a 4x compression ratio, which is reasonable for well-structured time-series data with OpenZL. - The
decompressmethod conceptually reconstructs the data. In a real OpenZL implementation, it would use the reverse operations of the codecs chosen during compression to perfectly restore the original data.
- The
- In
main():- We generate 1000 sample
SensorReadingobjects. - We create our
sensor_schemaand conceptually “register” it with OpenZL. This is the moment OpenZL understands our data’s structure. - An
OpenZL::Compressoris initialized, which internally would build the specialized codec graph based onSensorReadingSchema. - The
original_datais passed to the compressor, yielding acompressed_buffer. - An
OpenZL::Decompressoris initialized. - The
compressed_bufferis passed to the decompressor, which reconstructs thedecompressed_data. - Finally, we perform a conceptual verification, checking the size and showing the first record. In a real test, you’d compare every field of every record to ensure perfect lossless decompression.
- We generate 1000 sample
To compile and run this (assuming you save the files as sensor_data_generator.h, openzl_schema_definition.h, and main.cpp in the same directory):
# Compile
g++ -std=c++17 main.cpp -o sensor_compressor
# Run
./sensor_compressor
You’ll see output demonstrating the data generation, schema registration, and the conceptual compression/decompression process, including the simulated size reduction.
Step 4: Observing the Codec Graph (Conceptual)
While our C++ example simulates the API, it’s important to understand what OpenZL is actually doing under the hood. When you initialize an OpenZL::Compressor with a registered schema, OpenZL internally performs a crucial step: it analyzes the schema and potentially some initial data (if provided for training) to construct an optimal codec graph.
For our SensorReadingSchema, OpenZL might choose a graph that looks something like this:
Figure 15.2: A conceptual OpenZL Codec Graph for time-series compression.
Explanation of the graph:
- Input Data Stream: The raw
SensorReadingobjects. - Field Extraction: OpenZL “de-interleaves” the structured records into separate streams for each field based on the schema.
- Specialized Codecs:
Timestamp_ms(INT64): Often benefits from Delta Encoding (storing differences between consecutive timestamps, which are usually small and positive) followed by VarInt Encoding (variable-length integer encoding, efficient for small numbers).Temperature_candHumidity_percent(FLOAT32): For floating-point time-series data, advanced codecs like Gorilla Encoding (from Facebook’s Gorilla paper) or Chimp Encoding are highly effective. These exploit patterns in floating-point representations and small changes between values.
- Output Stream: The outputs of these specialized codecs are then concatenated into a single, highly compressed byte stream.
This is the power of OpenZL: it automatically composes these specialized algorithms based on your data’s schema, leading to much better compression than generic methods.
Mini-Challenge: Explore Schema Impact
You’ve seen how a basic schema works. Now, let’s play with it a bit.
Challenge: Modify the create_sensor_schema() function in openzl_schema_definition.h.
- Change
humidity_percentfromFLOAT32toINT64(conceptually, imagine your sensor reports integer humidity). - Rethink the
generate_sensor_datafunction (insensor_data_generator.h) to ensurehumidity_percentactually generates integer values, or values that can be safely cast to integers for this experiment. - Compile and run the
main.cppagain.
Hint: When you change humidity_percent to INT64 in the schema, you’re telling OpenZL to prepare different codecs for it. How might this affect the conceptual compression ratio?
What to observe/learn:
- Even though our
compressfunction is simulated, conceptually, changing a field’s type in the schema would cause OpenZL to select different, potentially more efficient, codecs. AnINT64field might use a delta+VarInt scheme, while aFLOAT32might use Gorilla. - This exercise emphasizes that the schema is not just documentation; it’s an active ingredient that guides OpenZL’s compression engine.
Common Pitfalls & Troubleshooting
Working with a powerful framework like OpenZL can have its quirks. Here are a few common pitfalls when compressing structured data, especially time-series:
Schema Mismatch:
- Pitfall: The data you’re feeding to OpenZL doesn’t match the schema you defined. For example, your schema expects a
FLOAT32but your data provides anINT64in that field, or fields are missing/misordered. - Troubleshooting: OpenZL’s API will typically throw an error or return an invalid state if the data doesn’t conform to the schema. Carefully review your data generation/parsing logic and compare it field-by-field with your
OpenZL::Schemadefinition. Use debugging tools to inspect the raw data before compression.
- Pitfall: The data you’re feeding to OpenZL doesn’t match the schema you defined. For example, your schema expects a
Suboptimal Schema Definition:
- Pitfall: You’ve correctly defined the types, but haven’t provided enough metadata to help OpenZL optimize further. For instance, if a timestamp field is always monotonically increasing, explicitly telling OpenZL this could enable more aggressive delta encoding.
- Troubleshooting: For advanced use cases, consult the official OpenZL documentation (once available in detail) for schema extensions or flags that hint at data characteristics (e.g.,
is_monotonic,value_range). Experiment with these if compression isn’t meeting expectations.
Performance with Unstructured Data:
- Pitfall: Trying to compress highly irregular or truly unstructured data (e.g., arbitrary log messages without a consistent format) with OpenZL. While OpenZL can handle fields of generic bytes, its strength is in structured data.
- Troubleshooting: If your data has highly variable structure, OpenZL might not offer significant advantages over generic compressors. Consider pre-processing such data to extract structured components or use OpenZL only on the structured parts, while using a general-purpose compressor for the unstructured blobs. OpenZL performs best on data where a clear schema can be defined.
Summary
Congratulations! You’ve successfully completed a conceptual project to compress time-series sensor data using OpenZL. Here are the key takeaways:
- Time-Series Data Characteristics: You understand why time-series data, with its temporal order and patterns, is an ideal candidate for specialized compression.
- Schema is Key: Defining an accurate schema (
OpenZL::Schema) is paramount. It tells OpenZL the logical structure of your data, enabling format-aware compression. - Specialized Codec Graphs: OpenZL builds a custom
Codec Graphby selecting and composing optimal codecs (like Delta, VarInt, Gorilla/Chimp encoding) for each field based on its type and characteristics. - Practical Application: You’ve seen how to conceptually generate data, define a schema, and then use OpenZL’s compression and decompression mechanisms.
- Impact of Schema: Even small changes in your schema (like data types) can conceptually influence the underlying codec choices and thus the compression efficiency.
This project should give you a strong foundation for approaching real-world structured data compression problems with OpenZL. In the next chapter, we’ll explore integrating OpenZL into existing data pipelines and discuss more advanced optimization techniques.
References
- Meta Engineering Blog: Introducing OpenZL: An Open Source Format-Aware Compression Framework
- OpenZL GitHub Repository (facebook/openzl)
- InfoQ: Meta Open Sources OpenZL: a Universal Compression Framework
- Conceptual Overview of OpenZL (openzl.org)
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.