pandas.read_csv to the rescue:
import pandas as pd
df = pd.read_csv("data.csv")
print(df)
This outputs a pandas DataFrame:
Date price factor_1 factor_2
0 2012-06-11 1600.20 1.255 1.548
1 2012-06-12 1610.02 1.258 1.554
2 2012-06-13 1618.07 1.249 1.552
3 2012-06-14 1624.40 1.253 1.556
4 2012-06-15 1626.15 1.258 1.552
5 2012-06-16 1626.15 1.263 1.558
6 2012-06-17 1626.15 1.264 1.572
Answer from root on Stack OverflowHow to best import a csv file into pandas which is really 3 dataframes in one csv?
I wrote a detailed guide of how Pandas' read_csv() function actually works and the different engine options available, including new features in v2.0. Figured it might be of interest here!
loading a csv into a dataframe with a datetime as an index
You can by specifying the index_col keyword as the column position, as an int, that should be used as the index and the parse_dates keyword as True. Example:
# Pretend the dates are in the 2nd column
df = pd.read_csv('bunch_of_dated_data.csv', index_col=1, parse_dates=True) More on reddit.com pandas.read_csv() is only working with certain filenames... (note: NOT only with certain files)
Videos
pandas.read_csv to the rescue:
import pandas as pd
df = pd.read_csv("data.csv")
print(df)
This outputs a pandas DataFrame:
Date price factor_1 factor_2
0 2012-06-11 1600.20 1.255 1.548
1 2012-06-12 1610.02 1.258 1.554
2 2012-06-13 1618.07 1.249 1.552
3 2012-06-14 1624.40 1.253 1.556
4 2012-06-15 1626.15 1.258 1.552
5 2012-06-16 1626.15 1.263 1.558
6 2012-06-17 1626.15 1.264 1.572
To read a CSV file as a pandas DataFrame, you'll need to use pd.read_csv, which has sep=',' as the default.
But this isn't where the story ends; data exists in many different formats and is stored in different ways so you will often need to pass additional parameters to read_csv to ensure your data is read in properly.
Here's a table listing common scenarios encountered with CSV files along with the appropriate argument you will need to use. You will usually need all or some combination of the arguments below to read in your data.
┌───────────────────────────────────────────────────────┬───────────────────────┬────────────────────────────────────────────────────┐
│ pandas Implementation │ Argument │ Description │
├───────────────────────────────────────────────────────┼──────────────────────┼────────────────────────────────────────────────────┤
│ pd.read_csv(..., sep=';') │ sep/delimiter │ Read CSV with different separator¹ │
│ pd.read_csv(..., delim_whitespace=True) │ delim_whitespace │ Read CSV with tab/whitespace separator │
│ pd.read_csv(..., encoding='latin-1') │ encoding │ Fix UnicodeDecodeError while reading² │
│ pd.read_csv(..., header=False, names=['x', 'y', 'z']) │ header and names │ Read CSV without headers³ │
│ pd.read_csv(..., index_col=[0]) │ index_col │ Specify which column to set as the index⁴ │
│ pd.read_csv(..., usecols=['x', 'y']) │ usecols │ Read subset of columns │
│ pd.read_csv(..., thousands='.', decimal=',') │ thousands and decimal │ Numeric data is in European format (eg., 1.234,56) │
└───────────────────────────────────────────────────────┴───────────────────────┴────────────────────────────────────────────────────┘
Footnotes
By default,
read_csvuses a C parser engine for performance. The C parser can only handle single character separators. If your CSV has a multi-character separator, you will need to modify your code to use the'python'engine. You can also pass regular expressions:df = pd.read_csv(..., sep=r'\s*\|\s*', engine='python')
UnicodeDecodeErroroccurs when the data was stored in one encoding format but read in a different, incompatible one. Most common encoding schemes are'utf-8'and'latin-1', your data is likely to fit into one of these.
header=Falsespecifies that the first row in the CSV is a data row rather than a header row, and thenames=[...]allows you to specify a list of column names to assign to the DataFrame when it is created."Unnamed: 0" occurs when a DataFrame with an un-named index is saved to CSV and then re-read after. Instead of having to fix the issue while reading, you can also fix the issue when writing by using
df.to_csv(..., index=False)
There are other arguments I've not mentioned here, but these are the ones you'll encounter most frequently.
I work with some csv files that I get from some outside entity (i.e. there's no way to actually change the csv files themselves) that contain data that I'd like to try to process more efficiently. However, the way that the csv files are structured is that they are really 3 datasets in one file. For example the first dataset has 46 columns, the second has 27 and the third has 15. The datatypes for each column doesn't match with each of the datasets either.
Thus, I'm trying to figure out the most efficient and clean way to import these files and have them split into their respective datasets. So far, the best way I've figured out to do this is by importing the entire file into 1 big dataset, then I subset the big dataset into 3 by searching for some substrings in the first column which can be used to figure out which dataset they belong to. Then I rename the columns and make the columns their appropriate datatype. However, this feels like I'm essentially importing the file twice and doesn't seem very clean.
I was wondering if there was a way for pandas to import the file and only read in rows that have a certain number of elements or something, or if you know of another more efficient way to read in a file like the one I mentioned it would be much appreciated!