June 21, 2020
(This article originally appeared, in a slightly different form, in Towards Data Science, Medium.)
In any data analysis project, we seek to discover useful, actionable insights from a given set of data. If we're lucky, this data may already be packaged for us; if not, we may need to collect it ourselves. Either way, once the data is stored, we will need to read it into a program to perform analysis.
The software package pandas is often used for this purpose. It is a powerful library that works with the Python programming language. Within pandas, the tool of choice to read in data files is the ubiquitous read_csv
function.
In this article, we explore the basics of pandas' read_csv
command: header options, specifying the sub-directory, if applicable, using delimiters other than commas, identifying which column to use as the index, defining types of fields, and handling missing values.
The Data
As with any pandas project, we first import pandas using the standard aliaspd
. Let's also import the numerical package NumPy using its standard alias np
.
import pandas as pd import numpy as np
For our example, let's read in stock price data. This data set lists the date of the trade, its open, close, high and low prices, the volume, and its ticker symbol.
What is a .csv file?
If we look at the data file in a text editor, the first few rows are as follows.
date,open,high,low,close,volume,Name 2013–02–08,15.07,15.12,14.63,14.75,8407500,AAL 2013–02–11,14.89,15.01,14.26,14.46,8882000,AAL 2013–02–12,14.45,14.51,14.1,14.27,8126000,AAL 2013–02–13,14.3,14.94,14.25,14.66,10259500,AAL 2013–02–14,14.94,14.96,13.16,13.99,31879900,AAL
We see that the first row consists of the column names, also known as
Header options
The pandas command to read the data from a file and store it in a data frame called stocks_df
is a simple one liner.
stocks_df = pd.read_csv('stocks.csv')
To make sure this command worked as expected, let’s examine the first few rows of the data frame using head.
stocks_df.head()
date open high low close volume Name 0 2013-02-08 15.07 15.12 14.63 14.75 8407500 AL 1 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL 2 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL 3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL 4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL
You might not be interested in all the columns in the .csv file. In this case, specify which columns you want to read into the data frame by using the usecols
option. For instance, if you're only interested in the date, the volume and the name of the stock, specify usecols=['date', 'volume', 'Name']
.
stocks_df = pd.read_csv('stocks.csv', usecols=['date', 'volume', 'Name']) stocks_df.head()
date volume Name 0 2013-02-08 8407500 AAL 1 2013-02-11 8882000 AAL 2 2013-02-12 8126000 AAL 3 2013-02-13 10259500 AAL 4 2013-02-14 31879900 AAL
As expected, only the columns we specified are included in the data frame.
By default, pandas assumes that the first row of your data is a header. Of course, not every data file has a header. For instance, suppose we have a data file without a header, as shown below.
2013-02-08,15.07,15.12,14.63,14.75,8407500,AAL 2013-02-11,14.89,15.01,14.26,14.46,8882000,AAL 2013-02-12,14.45,14.51,14.1,14.27,8126000,AAL 2013-02-13,14.3,14.94,14.25,14.66,10259500,AAL 2013-02-14,14.94,14.96,13.16,13.99,31879900,AAL
If we try to read the file in as before, we will not get the results we want.
stocks_df = pd.read_csv('stocks_no_header.csv') stocks_df.head()
2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL 0 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL 1 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL 2 2013-02-13 14.30 14.94 14.25 14.66 10259500 AL 3 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL 4 2013-02-15 13.93 14.61 13.93 14.50 15628000 AALpandas assumes the first line is a header, and reads it in as such. In this case, this isn't what we want, since that first row is data. Fortunately, the fix is easy — just specify the option
header=None
.
stocks_df = pd.read_csv('stocks_no_header.csv', header=None) stocks_df.head()
0 1 2 3 4 5 6 0 2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL 1 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL 2 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL 3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL 4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL
Now pandas has numbered the columns starting at 0. If you prefer to have named columns instead, use the names option to specify your own column names.
col_names = ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Ticker Symbol'] stocks_df = pd.read_csv('stocks_no_header.csv', header=None, names=col_names) stocks_df.head()
Date Open High Low Close Volume Ticker Symbol 0 2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL 1 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL 2 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL 3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL 4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL
Directories
In the example above, I saved the data file as stocks.csv
in the same directory as my Python program. To keep things organized, it is quite common to save the data files in a sub directory. In this case, specify the data sub-directory in the read_csv statement.
df = pd.read_csv('data/stocks.csv')
As before, let's check the first few rows to make sure all went well.
df.head()
date open high low close volume Name 0 2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL 1 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL 2 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL 3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL 4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL
Delimiters
Not all data files separate values using commas. Tabs, spaces, or any other character may be used. For instance, our data file might have been set up with semicolons instead of commas.
date;open;high;low;close;volume;Name 2013-02-08;15.07;15.12;14.63;14.75;8407500;AAL 2013-02-11;14.89;15.01;14.26;14.46;8882000;AAL 2013-02-12;14.45;14.51;14.1;14.27;8126000;AAL 2013-02-13;14.3;14.94;14.25;14.66;10259500;AAL 2013-02-14;14.94;14.96;13.16;13.99;31879900;AAL
To specify a separator other than a comma, use the sep option.
stocks_df = pd.read_csv('stocks_semicolon.csv', sep=';') stocks_df.head()
date open high low close volume Name 0 2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL 1 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL 2 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL 3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL 4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL
Specifying a column to be the index
If you don’t want the default index, you may specify a column to be used as the index of the data frame. For instance, to specify the index to be the date column, use index_col='date'
.
open high low close volume Name date 2013-02-08 15.07 15.12 14.63 14.75 8407500 AAL 2013-02-11 14.89 15.01 14.26 14.46 8882000 AAL 2013-02-12 14.45 14.51 14.10 14.27 8126000 AAL 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL
Now the date column is the index, as desired.
Types
Let's check the type of each column.
stocks_df = pd.read_csv('stocks.csv') stocks_df.dtypes
date object open float64 high float64 low float64 close float64 volume int64 Name object dtype: object
Let's suppose we want to store the numerical data as 32 bit types to save space. To do so, set the option dtype
to a dictionary in which the keys are the columns and the values are the desired types.
types = { 'open': np.float32, 'high': np.float32, 'low': np.float32, 'close': np.float32, 'volume': np.int32 } stocks_df = pd.read_csv('stocks.csv', dtype=types) stocks_df.dtypes
date object open float32 high float32 low float32 close float32 volume int32 Name object dtype: object
The date
column was read in as a generic object type, even though we know that the data consists of dates. To read the column in as date type, use the parse_dates
option set to a list containing the index of the column.
date datetime64[ns] open float64 high float64 low float64 close float64 volume int64 Name object dtype: object
Missing values
It is common for data to contain missing values. It is also common for missing values to be indicated by some sort of indicator, such as a question mark or the word “missing”.
stocks_df = pd.read_csv('stocks_missing_data.csv') stocks_df.head()
data open high low close volume Name 0 2013-02-08 NaN NaN 14.63 14.75 8407500 AAL 1 2013-02-11 14.89 NaN 14.26 14.46 8882000 AAL 2 2013-02-12 14.45 14.51 ? 14.27 8126000 AAL 3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL 4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AAL
As you can see, there are two values missing from the first row of data and one from the second row. In addition, the third data row has a question mark in one of the entries. How do we deal with this?
The missing values were read in as “Not a Number” (NaN) by pandas as we would expect, so there is nothing further required from us for these values. However, pandas does not know what to do with the question mark indicating a missing value. To rectify this situation, we can pass a list of values to be considered missing values in a list to the na_values
option.
stocks_df = pd.read_csv('stocks_missing_data.csv', na_values=['?']) stocks_df.head()
date open high low close volume Name 0 2013-02-08 NaN NaN 14.63 14.75 8407500 AAL 1 2013-02-11 14.89 NaN 14.26 14.46 8882000 AAL 2 2013-02-12 14.45 14.51 NaN 14.27 8126000 AAL 3 2013-02-13 14.30 14.94 14.25 14.66 10259500 AAL 4 2013-02-14 14.94 14.96 13.16 13.99 31879900 AALNow the question mark has been read in as a NaN.
Further Reading
The read_csv
command is highly flexible, with many, many options. To further explore these options, refer to the documentation.