Sunday, January 18, 2026

CSV vs. Parquet vs. Arrow: Storage Codecs Defined


CSV vs. Parquet vs. Arrow: Storage Codecs Defined
Picture by Creator

 

Introduction

 
Hugging Face Datasets supplies one of the most simple strategies to load datasets utilizing a single line of code. These datasets are steadily out there in codecs comparable to CSV, Parquet, and Arrow. Whereas all three are designed to retailer tabular information, they function in another way on the backend. The selection of every format determines how information is saved, how shortly it may be loaded, how a lot cupboard space is required, and the way effectively the information varieties are preserved. These variations turn out to be more and more vital as datasets develop bigger and fashions extra advanced. On this article, we’ll take a look at how Hugging Face Datasets works with CSV, Parquet, and Arrow, what really makes them completely different on disk and in reminiscence, and when each is smart to make use of. So, let’s get began.

 

1. CSV

 
CSV stands for Comma-Separated Values. It’s simply textual content, one row per line, columns separated by commas (or tabs). Virtually each instrument can open it i.e. Excel, Google Sheets, pandas, databases and so on. It’s quite simple and interoperable.

Instance:
title,age,metropolis
Kanwal,30,New York
Qasim,25,Edmonton

 

Hugging Face treats it as a row-based format, that means it reads information row by row. Whereas that is acceptable for small datasets, the efficiency deteriorates with scaling. Moreover, there are another limitations, comparable to:

  • No specific schema: As all information is saved in textual content format, varieties must be inferred each time the file is loaded. This will trigger errors if the information just isn’t constant.
  • Giant measurement and sluggish I/O: Textual content storage will increase the file measurement, and parsing numbers from textual content is CPU-intensive.

 

2. Parquet

 
Parquet is a binary columnar format. As a substitute of writing rows one after one other like CSV, Parquet teams values by column. That makes reads and queries a lot sooner while you solely want a number of columns, and compression retains file sizes and I/O low. Parquet additionally shops a schema so varieties are preserved. It really works greatest for batch processing and large-scale analytics, not for a lot of small, frequent updates to the identical file (It’s higher for batch writes than fixed edits). If we take the above CSV instance, it can retailer all names collectively, all ages collectively, and all cities collectively. That is the columnar structure and the instance would appear like this:

Names: Kanwal, Qasim
Ages: 30, 25
Cities: New York, Edmonton

 

It additionally provides metadata for every column: the kind, min/max values, null counts, and compression data. This enables sooner reads, environment friendly storage, and correct kind dealing with. Compression algorithms like Snappy or Gzip additional cut back disk house. It has following strengths:

  • Compression: Related column values compress nicely. Information are smaller and cheaper to retailer.
  • Column-wise studying: Load solely the columns you want, rushing up queries.
  • Wealthy typing: Schema is saved, so no guessing varieties on each load.
  • Scale: Works nicely for hundreds of thousands or billions of rows.

 

3. Arrow

 
Arrow just isn’t the identical as CSV or Parquet. It’s a columnar format saved in reminiscence for quick operations. In Hugging Face, each Dataset is backed by an Arrow desk, whether or not you began from CSV, Parquet, or an Arrow file. Persevering with with the identical instance desk, Arrow additionally shops information column by column, however in reminiscence:

Names: contiguous reminiscence block storing Kanwal, Qasim
Ages: contiguous reminiscence block storing 30, 25
Cities: contiguous reminiscence block storing New York, Edmonton

 

As a result of information is in contiguous blocks, operations on a column (like filtering, mapping, or summing) are extraordinarily quick. Arrow additionally helps reminiscence mapping, which permits datasets to be accessed from disk with out totally loading them into RAM. Among the key advantages of this format are:

  • Zero-copy reads: Reminiscence-map information with out loading every thing into RAM.
  • Quick column entry: Columnar structure allows vectorized operations.
  • Wealthy varieties: Handles nested information, lists, tensors.
  • Interoperable: Works with pandas, PyArrow, Spark, Polars, and extra.

 

Wrapping Up

 
Hugging Face Datasets makes switching codecs routine. Use CSV for fast experiments, Parquet to retailer giant tables, and Arrow for quick in-memory coaching. Realizing when to make use of every retains your pipeline quick and easy, so you may spend extra time on the mannequin.
 
 

Kanwal Mehreen is a machine studying engineer and a technical author with a profound ardour for information science and the intersection of AI with medication. She co-authored the book “Maximizing Productiveness with ChatGPT”. As a Google Technology Scholar 2022 for APAC, she champions variety and educational excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower ladies in STEM fields.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles