Read a small random sample from a big CSV file into a Pandas data frame

stackoverflow.com › questions › 22258491 › read-a-small-random-sample-from-a-big-csv-file-into-a-pandas-data-frame

Assuming no header in the CSV file:

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)

would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.

With header and unknown file length:

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

Answer from dlm on Stack Overflow

pandas.pydata.org › docs › reference › api › pandas.read_csv.html

pandas.read_csv — pandas 3.0.1 documentation

Encoding to use for UTF when reading/writing (ex. 'utf-8'). List of Python standard encodings . encoding_errorsstr, optional, default ‘strict’ · How encoding errors are treated. List of possible values . ... If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting. If it is necessary to override values, a ParserWarning will be issued. See csv.Dialect documentation for more details.

datacamp.com › tutorial › pandas-read-csv

pandas read_csv() Tutorial: Importing Data | DataCamp

December 23, 2025 - For example, pd.read_csv('file.csv', comment='#'). Use the header parameter to specify which row to use as the column names. If there are multiple header rows, you can also use the names parameter to assign new column names. If the file structure is complex, you might need to pre-process the ...

Videos

Python pandas read_csv // Load a CSV into pandas from a file or ...

November 29, 2023

Data Loading: Read CSV Files in Python - Pandas Read_CSV() EASY ...

September 7, 2023

ℙ𝕒𝕟𝕕𝕒𝕤 𝕣𝕖𝕒𝕕_𝕔𝕤𝕧() | Read ...

Pandas read_csv Part 1 - Column and Row Arguments for Reading into ...

The six most important read_csv arguments in Pandas - YouTube

Pandas Read CSV Example | How to Read CSV in Pandas Example - YouTube

September 24, 2023

w3schools.com › python › pandas › pandas_csv.asp

Pandas Read CSV

CSV files contains plain text and is a well know format that can be read by everyone including Pandas. In our examples we will be using a CSV file called 'data.csv'.

medium.com › analytics-vidhya › make-the-most-out-of-your-pandas-read-csv-1531c71893b5

Make the Most Out of your pandas.read_csv() | by Melissa Rodriguez | Analytics Vidhya | Medium

December 17, 2019 - Here is the csv file and code I tried first to import the fertility rate data used for my previous blogs: ... For my analysis I want to use all columns except the ones named Indicator Name and Indicator Code. Also the column for 2018 year is empty so I do not need it as well. #import pandas library import pandas as pd#import fertility rate data df = pd.read_csv('data/API_SP.DYN.TFRT.IN_DS2_en_csv_v2_41035.csv', skiprows = 4)#remove unnecesary columns: df = df.drop(columns = ['Indicator Name','Indicator Code','Unnamed: 63','2018'])#renaming columns df.rename(columns={'Country Name':'CountryName', 'Country Code':'CountryCode3'}, inplace=True)df.head()

stackoverflow.com › questions › 22258491 › read-a-small-random-sample-from-a-big-csv-file-into-a-pandas-data-frame

python - Read a small random sample from a big CSV file into a Pandas data frame - Stack Overflow

Assuming no header in the CSV file:

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)

would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.

With header and unknown file length:

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

@dlm's answer is great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.

Note also that their answer for unknown file length relies on iterating through the file twice -- once to get the length, and then another time to read the csv. I have three solutions here which only rely on iterating through the file once, though they all have tradeoffs.

Solution 1: Approximate Percentage

If you can specify what percent of lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:

import pandas as pd
import random
p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

As pointed out in the comments, this only gives approximately the right number of lines, but I think it satisfies the desired usecase.

Solution 2: Every Nth line

This isn't actually a random sample, but depending on how your input is sorted and what you're trying to achieve, this may meet your needs.

n = 100  # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

Solution 3: Reservoir Sampling

(Added July 2021)

Reservoir sampling is an elegant algorithm for selecting k items randomly from a stream whose length is unknown, but that you only see once.

The big advantage is that you can use this without having the full dataset on disk, and that it gives you an exactly-sized sample without knowing the full dataset size. The disadvantage is that I don't see a way to implement it in pure pandas, I think you need to drop into python to read the file and then construct the dataframe afterwards. So you may lose some functionality from read_csv or need to reimplement it, since we're not using pandas to actually read the file.

Taking an implementation of the algorithm from Oscar Benjamin here:

from math import exp, log, floor
from random import random, randrange
from itertools import islice
from io import StringIO

def reservoir_sample(iterable, k=1):
    """Select k items uniformly from iterable.

    Returns the whole population if there are k or fewer items

    from https://bugs.python.org/issue41311#msg373733
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    W = exp(log(random())/k)
    while True:
        # skip is geometrically distributed
        skip = floor( log(random())/log(1-W) )
        selection = list(islice(iterator, skip, skip+1))
        if selection:
            values[randrange(k)] = selection[0]
            W *= exp(log(random())/k)
        else:
            return values

def sample_file(filepath, k):
    with open(filepath, 'r') as f:
        header = next(f)
        result = [header] + sample_iter(f, k)
    df = pd.read_csv(StringIO(''.join(result)))

The reservoir_sample function returns a list of strings, each of which is a single row, so we just need to turn it into a dataframe at the end. This assumes there is exactly one header row, I haven't thought about how to extend it to other situations.

I tested this locally and it is much faster than the other two solutions. Using a 550 MB csv (January 2020 "Yellow Taxi Trip Records" from the NYC TLC), solution 3 runs in about 1 second, while the other two take ~3-4 seconds.

In my test this is even slightly (~10-20%) faster than @Bar's answer using shuf, which surprises me.

geeksforgeeks.org › pandas › python-read-csv-using-pandas-read_csv

Pandas Read CSV in Python - GeeksforGeeks

In this example, we will take a CSV file and then add some special characters to see how the sep parameter works. ... import pandas as pd data = """totalbill_tip, sex:smoker, day_time, size 16.99, 1.01:Female|No, Sun, Dinner, 2 10.34, 1.66, Male, No|Sun:Dinner, 3 21.01:3.5_Male, No:Sun, Dinner, 3 23.68, 3.31, Male|No, Sun_Dinner, 2 24.59:3.61, Female_No, Sun, Dinner, 4 25.29, 4.71|Male, No:Sun, Dinner, 4""" with open("sample.csv", "w") as file: file.write(data) print(data)

Published February 18, 2026

Find elsewhere

Google Bing Mojeek

askpython.com › home › how to read csv with headers using pandas?

How to Read CSV with Headers Using Pandas? - AskPython

January 21, 2026 - When you call pd.read_csv(), Pandas scans the first row of your CSV file and treats it as column names. This behavior is controlled by the header parameter, which defaults to header=0.

pyimagesearch.com › home › blog › read csv file using pandas read_csv (pd.read_csv)

Read csv file using Pandas read_csv (pd.read_csv) - PyImageSearch

November 30, 2024 - Below is a simple example to demonstrate how to use the usecols parameter with pd.read_csv. # Import the pandas library import pandas as pd # Assume we have a CSV file named 'large_video_game_sales.csv' # We only need the columns 'Name', 'Platform', and 'Global_Sales' # Load specific columns using the usecols parameter specific_columns = pd.read_csv( './data/large_video_game_sales.csv', usecols=['Name', 'Platform', 'Global_Sales'] ) # Display the first few rows to verify the data print(specific_columns.head())

Spark By {Examples}

sparkbyexamples.com › home › pandas › pandas read_csv() with examples

Pandas read_csv() with Examples - Spark By {Examples}

June 5, 2025 - In this article, I will explain the usage of some of these options with examples. To read a CSV file with comma delimiter use pandas.read_csv() and to read tab delimiter (\t) file use read_table().

stackoverflow.com › questions › 68389579 › how-to-read-csv-file-using-pandas-jupyter-notebooks

python - How to read CSV file using Pandas (Jupyter notebooks) - Stack Overflow

just some explanation aside. Before you can use pd.read_csv to import your data, you need to locate your data in your filesystem.

Asuming you use a jupyter notebook or pyton file and the csv-file is in the same directory you are currently working in, you just can use:

import pandas as pd SouthKoreaRoads_df = pd.read_csv('SouthKoreaRoads.csv')

If the file is located in another directy, you need to specify this directory. For example if the csv is in a subdirectry (in respect to the python / jupyter you are working on) you need to add the directories name. If its in folder "data" then add data in front of the file seperated with a "/"

import pandas as pd SouthKoreaRoads_df = pd.read_csv('data/SouthKoreaRoads.csv')

Pandas accepts every valid string path and URLs, thereby you could also give a full path.

import pandas as pd SouthKoreaRoads_df = pd.read_csv('C:\Users\Ron\Desktop\Clients.csv')

so until now no OS-package needed. Pandas read_csv can also pass OS-Path-like-Objects but the use of OS is only needed if you want specify a path in a variable before accessing it or if you do complex path handling, maybe because the code you are working on needs to run in a nother environment like a webapp where the path is relative and could change if deployed differently.

please see also:

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html https://docs.python.org/3/library/os.path.html

BR

SouthKoreaRoads = pd.read_csv("./SouthKoreaRoads.csv")

Try this and see whether it could help!

programiz.com › python-programming › pandas › csv

Pandas CSV (With Examples)

In this example, we read a CSV file using the read_csv() method. We specified some arguments while reading the file to load the necessary data in appropriate format. ... We used read_csv() to read data from a CSV file into a DataFrame. Pandas also provides the to_csv() function to write data ...

pandas.pydata.org › docs › dev › reference › api › pandas.read_csv.html

pandas.read_csv — pandas documentation

Encoding to use for UTF when reading/writing (ex. 'utf-8'). List of Python standard encodings . encoding_errorsstr, optional, default ‘strict’ · How encoding errors are treated. List of possible values . ... If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting. If it is necessary to override values, a ParserWarning will be issued. See csv.Dialect documentation for more details.

fabi.ai › blog › how-to-read-a-csv-with-python-pandas-made-easy

How to read a CSV with Python pandas (made easy) | Fabi.ai

In this tutorial we explore using Python pandas pd.read_csv() to read a CSV file into a DataFrame. We also explore advanced parameters that are commonly used and useful along with some other ways to analyze CSV data using Python.

MachineLearningPlus

machinelearningplus.com › pandas › pandas-read_csv-completed

Pandas read_csv() - How to read a csv file in Python - MachineLearningPlus

March 8, 2022 - Syntax: pandas.read_csv( filepath_or_buffer, sep, header, index_col, usecols, prefix, dtype, converters, skiprows, skiprows, nrows, na_values, parse_dates)Purpose: Read a comma-separated values (csv) file into DataFrame.

reddit.com › r/python › i wrote a detailed guide of how pandas' read_csv() function actually works and the different engine options available, including new features in v2.0. figured it might be of interest here!

r/Python on Reddit: I wrote a detailed guide of how Pandas' read_csv() function actually works and the different engine options available, including new features in v2.0. Figured it might be of interest here!

March 30, 2023 - I don't know why you would expect a function called read_csv to be simple and parsimonious though. CSV is not a standardized file format, there's probably just many variations on it as there CSV files. I'm not saying pandas has the best API, but of course complex things have complex solutions.

pandas.pydata.org › pandas-docs › version › 2.0 › reference › api › pandas.read_csv.html

pandas.read_csv — pandas 2.0.3 documentation

Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle (e.g.

pandas.pydata.org › docs › getting_started › intro_tutorials › 02_read_write.html

How do I read and write tabular data? — pandas 3.0.1 documentation

I want to analyze the Titanic passenger data, available as a CSV file. ... pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame.

pandas.pydata.org › pandas-docs › version › 1.5 › reference › api › pandas.read_csv.html

pandas.read_csv — pandas 1.5.3 documentation

Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handle (e.g.

pandas.pydata.org › pandas-docs › version › 0.19.0 › generated › pandas.read_csv.html

pandas.read_csv — pandas 0.19.0 documentation

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, ...

pythonbasics.org › read-csv-with-pandas

Read CSV with Pandas - Python Tutorial

To read the csv file as pandas.DataFrame, use the pandas function read_csv() or read_table().