Reading partitioned Delta table with Polars

I recently had the occasion to revisit some findings about the best practice for reading partitioned Deltalake data from Python. We use Polars for processing data, so something that integrates well with Polars is more convenient. Polars has a scan_delta method, but this used to not be optimal for reading partitioned data. The library has progressed so much recently though that it was due for a retest.

I had 2 datasets, each partitioned in 3 roughly equal chunks – 1 very small (700kb), and a somewhat bigger one (79mb), and wrote a script (or rather I had cursor generate a script) to analyze the memory utilization and running time for 3 possible approaches: use lazyframe with scan_delta (which is the most convenient), use dataframe with read_delta, or use the deltatable API directly with pyarrow options to specify the partition to read.

Pleasantly I was surprised to see that the LazyFrame + filter option, which is the most natural and convenient to use, is also the fastest and the most memory efficient. The script gave me this cute results table:

📈 FILTER + AGGREGATION STATISTICAL ANALYSIS

Approach Time (s) Memory (MB) Records
2. DataFrame + Filter + GroupBy 0.653±0.134 967.9±1.8 1330378±0
1. LazyFrame + Filter + GroupBy 0.282±0.033 26.8±0.6 1330378±0
3. DeltaTable + Filter + GroupBy 0.417±0.010 696.5±8.6 1330378±0

🏆 BEST PERFORMERS (Based on Averages):

  • Fastest: 1. LazyFrame + Filter + GroupBy
    Average time: 0.282s ± 0.033s
  • 💾 Most Memory Efficient: 1. LazyFrame + Filter + GroupBy
    Average memory: 26.8MB ± 0.6MB

Find the code on Github

Similar Posts