From JVM to Native: Building High-Performance RAG Document Extraction with GraalVM

Yobix

From JVM to Native: Building High-Performance RAG Document Extraction with GraalVM

Nadjib Mammeri

Nov 29, 2024

Our take on document extraction in RAG pipelines with a high-performance solution combining Rust, Apache Tika, and GraalVM. Achieving 25x faster processing, reduced memory usage, and eliminated JVM overhead for blazing-fast performance.

Intro

Every great engineering solution starts with a challenge. For us, it was the frustration of dealing with slow and memory-hungry document extraction tools in our RAG (Retrieval-Augmented Generation) pipeline. Traditional Python-based solutions, like unstructured-io, were simply not cutting it. They crawled under heavy loads, consumed excessive memory, and failed to efficiently utilize multiple CPU cores. This bottleneck led us to ask: Do we really need to rely on external services or complex Python frameworks just for content extraction? Couldn't this fundamental task be performed locally and efficiently?

This led us to create Extractous, taking a radically different approach:

Built the core in Rust for high performance and memory safety
Leveraged Apache Tika's battle-tested parsing capabilities
Used GraalVM's native compilation to eliminate the JVM overhead
Created a clean, focused API for one purpose: blazing-fast document extraction

The results? A 25x performance improvement over unstructured-io, with significantly lower memory usage. In this post, we'll dive into how we achieved this by combining GraalVM's native compilation capabilities with Rust's performance characteristics, creating a solution that's both powerful and practical for modern data processing pipelines.

Technical Architecture

To tackle the challenge of high-performance document extraction, we designed an architecture that leverages the strengths of GraalVM, Rust, and Apache Tika. This combination delivers unprecedented performance for document processing, transforming Tika from a JVM-based library into a native shared library that integrates seamlessly with our Rust core.

[Rust Core]  ←→  [GraalVM Native Library]  ←→  [Apache Tika]
↓                      ↓                         ↓
Performance           Native Compilation        Document Parsing
Memory Safety         Zero JVM Overhead        Format Support
Multi-threading       Shared Library           Content Extraction

The genius of this architecture lies in how GraalVM bridges two powerful worlds: Tika's mature document parsing capabilities and Rust's performance characteristics. GraalVM's native-image compiler transforms Tika from a JVM-based library into a native shared library, enabling direct integration with our Rust core without any JVM overhead.

The Power of Native Integration

GraalVM's native compilation is the cornerstone of our architecture. It allows us to eliminate JVM startup time and runtime overhead, enabling direct memory access between Rust and Tika. This results in near-native performance for Java code, while still maintaining Tika's robust parsing capabilities. The transformation is so seamless that Tika, once a JVM-based library, now runs as native code with minimal overhead.

It has a huge impact on our extraction pipeline.

Imagine processing a 50MB PDF file. With traditional tools like unstructured-io, you'd be looking at a memory peak of around 800MB and a processing time of 2.5 seconds. With Extractous, powered by GraalVM, the memory peak drops to just 70MB, and the processing time shrinks to a mere 0.1 seconds. This breakthrough transformed our RAG data extraction pipeline.

Optimizing Native Compilation

Our approach to native compilation starts with a relentless focus on performance. We utilize GraalVM's maximum optimization level along with parallel compilation across all available CPU cores. This is combined with pure ahead-of-time compilation, ensuring there are no runtime fallbacks that could impact performance. We also carefully manage resource initialization at build time, eliminating runtime overhead.

Cross-platform reliability was another crucial consideration in our native compilation strategy. We compile targeting compatibility across different CPU architectures while generating optimized shared libraries for each platform. This ensures that Extractous runs seamlessly whether you're on Linux, MacOS, or Windows, while still maintaining peak performance characteristics.

To handle real-world document processing needs, we enabled essential functionality like comprehensive character encoding support for international documents and built-in XHTML capabilities . The native library is optimized for headless operation in server environments, making it perfect for high-throughput document processing pipelines.

Beyond Simple Compilation

GraalVM's native-image technology does much more than just compile Java code. It performs sophisticated optimizations including aggressive dead code elimination and static analysis for optimal memory layout. The ahead-of-time optimization of common code paths ensures peak performance in real-world usage, while direct memory access without JVM heap management keeps memory usage low and predictable.

This powerful combination of technologies results in a library that starts instantly, runs faster, and uses significantly less memory than traditional Java applications, while maintaining all of Tika's powerful document processing capabilities. The end result is a solution that brings true native performance to Java-based document processing.

Build System Integration

Building Extractous is a breeze thanks to our seamless integration between GraalVM native compilation and Rust's Cargo build system. At the heart of this integration is an automated pipeline that performs the transformation described above. This means developers can focus on building features, not managing complex build configurations.

The build process orchestrates the compilation of Tika into native code using GraalVM, carefully managing aspects like character encoding support, XHTML capabilities, and cross-platform compatibility. By using GraalVM's ahead-of-time compilation with specific optimizations like -O3 and parallel compilation, we ensure the resulting native library maintains high performance while being portable across different platforms.

One of the key innovations in the build process is how we've automated the entire GraalVM workflow, from installation to native compilation. Developers don't need to manually manage GraalVM versions or worry about native-image configurations - the build system handles everything automatically, ensuring consistent results whether building on Linux, macOS, or Windows.

The end result is a native library that brings Tika's powerful document parsing capabilities to Rust with near-native performance, zero JVM overhead, and minimal startup time - all while maintaining the simplicity of a standard Cargo build process.

Performance Analysis

When we set out to build Extractous, our goal was to significantly improve document extraction performance. The results have exceeded our expectations, demonstrating an up to 25x speedup over unstructured-io.

Benchmark Results

Curious to see Extractous in action? Check out our benchmarking repository where we process large batches of SEC-10K filings, a common use case in financial analysis RAG applications. You'll see firsthand how Extractous outperforms traditional tools and why it's the go-to solution for high-performance document extraction.

Processing Speed: on average Extractous processes documents approximately 18 times faster than unstructured-io

Memory Usage and Startup Time: Consumes roughly 11 times less memory during operation and has near-instant initialization compared to Python-based solutions.

The performance gains come from our three-pronged approach:

First, by using GraalVM's native compilation, we eliminated the JVM overhead while keeping Tika's robust parsing capabilities.

Second, Rust's zero-cost abstractions and efficient memory management allow us to handle document processing with minimal overhead.

Third, the direct integration between Rust and the native-compiled Tika library eliminates costly interprocess communication and serialization overhead that's common in server-based solutions.

Lessons Learned

Throughout the development of Extractous, we've gathered valuable insights about combining GraalVM native compilation with Rust. Here are the key lessons that could benefit others undertaking similar projects.

Native Compilation Insights

The path to efficient GraalVM native compilation wasn't straightforward. We learned that careful configuration is crucial. Setting -march=compatibility proved essential for creating portable binaries that work across different CPU architectures. Without it, libraries built on newer CPUs would fail on older hardware.

Managing Tika's dependencies required careful consideration. We found that using a minimal logging setup (slf4j-nop) and routing all logging through a single channel significantly reduced complexity in the native build. This also helped avoid common GraalVM compilation issues related to dynamic class loading and reflection.

Cross-Platform Challenges

Building for multiple platforms revealed several key insights. Our initial attempts at cross-compilation often failed due to platform-specific assumptions. We learned to handle each platform's peculiarities explicitly:

Windows requires special handling for path separators and build commands
macOS ARM64 needs specific GraalVM builds (we use Liberica NIK)
Linux builds need to account for different libc versions

Memory Management Considerations

Working with large documents taught us valuable lessons about memory handling. The interaction between Rust's memory management and native libraries required careful attention to resource cleanup. We implemented proper drop handlers and explicit memory freeing to prevent leaks, especially when processing large batches of documents.

Build System Optimization

Our iterative improvements to the build system revealed that investing in good development experience pays off. Automatic GraalVM installation and caching of built artifacts significantly reduced friction for new developers. It also taught us the importance of clear error messages when builds fail - they save hours of debugging time.

Leverage AI
without leaking your data

Get in touch

Schedule a demo now to see how our cutting-edge RAG solution can empower your team, enhance decision-making, and drive innovation across your enterprise.

Leverage AI
without leaking your data

Get in touch

Schedule a demo now to see how our cutting-edge RAG solution can empower your team, enhance decision-making, and drive innovation across your enterprise.

Leverage AI
without leaking your data

Get in touch

Schedule a demo now to see how our cutting-edge RAG solution can empower your team, enhance decision-making, and drive innovation across your enterprise.