June 21, 2020

(This article originally appeared, in a slightly different form, in Towards Data Science, Medium.)

Newspaper showoing stocks

Photo by Markus Spiske on Unsplash

In any data analysis project, we seek to discover useful, actionable insights from a given set of data. If we're lucky, this data may already be packaged for us; if not, we may need to collect it ourselves. Either way, once the data is stored, we will need to read it into a program to perform analysis.

The software package pandas is often used for this purpose. It is a powerful library that works with the Python programming language. Within pandas, the tool of choice to read in data files is the ubiquitous read_csv function.

In this article, we explore the basics of pandas' read_csv command: header options, specifying the sub-directory, if applicable, using delimiters other than commas, identifying which column to use as the index, defining types of fields, and handling missing values.

The Data

As with any pandas project, we first import pandas using the standard alias pd. Let's also import the numerical package NumPy using its standard alias np.

import pandas as pd
import numpy as np

For our example, let's read in stock price data. This data set lists the date of the trade, its open, close, high and low prices, the volume, and its ticker symbol.

What is a .csv file?

If we look at the data file in a text editor, the first few rows are as follows.

date,open,high,low,close,volume,Name
2013–02–08,15.07,15.12,14.63,14.75,8407500,AAL
2013–02–11,14.89,15.01,14.26,14.46,8882000,AAL
2013–02–12,14.45,14.51,14.1,14.27,8126000,AAL
2013–02–13,14.3,14.94,14.25,14.66,10259500,AAL
2013–02–14,14.94,14.96,13.16,13.99,31879900,AAL

We see that the first row consists of the column names, also known as field names. This row containing the column names is called the header. The next rows list data corresponding to each of those fields within its row. So the first line of data has a date of 2013-02-08, an open price of $15.07 and so on. Each data point is separated from the next point by a comma, hence the name "comma separated value" (csv) file.

Header options

The pandas command to read the data from a file and store it in a data frame called stocks_df is a simple one liner.

stocks_df = pd.read_csv('stocks.csv')

To make sure this command worked as expected, let’s examine the first few rows of the data frame using head.

stocks_df.head()

    date        open    high    low     close   volume      Name
0   2013-02-08  15.07   15.12	14.63	14.75	8407500	    AL
1   2013-02-11  14.89   15.01	14.26	14.46	8882000	    AAL
2   2013-02-12  14.45   14.51	14.10	14.27	8126000	    AAL
3   2013-02-13  14.30   14.94	14.25	14.66	10259500    AAL
4   2013-02-14  14.94   14.96	13.16	13.99	31879900    AAL

You might not be interested in all the columns in the .csv file. In this case, specify which columns you want to read into the data frame by using the usecols option. For instance, if you're only interested in the date, the volume and the name of the stock, specify usecols=['date', 'volume', 'Name'].

stocks_df = pd.read_csv('stocks.csv', usecols=['date', 'volume', 'Name'])
stocks_df.head()

    date        volume      Name
0   2013-02-08	8407500	    AAL
1   2013-02-11	8882000	    AAL
2   2013-02-12	8126000	    AAL
3   2013-02-13	10259500    AAL
4   2013-02-14	31879900    AAL

As expected, only the columns we specified are included in the data frame.

By default, pandas assumes that the first row of your data is a header. Of course, not every data file has a header. For instance, suppose we have a data file without a header, as shown below.

2013-02-08,15.07,15.12,14.63,14.75,8407500,AAL
2013-02-11,14.89,15.01,14.26,14.46,8882000,AAL
2013-02-12,14.45,14.51,14.1,14.27,8126000,AAL
2013-02-13,14.3,14.94,14.25,14.66,10259500,AAL
2013-02-14,14.94,14.96,13.16,13.99,31879900,AAL

If we try to read the file in as before, we will not get the results we want.

stocks_df = pd.read_csv('stocks_no_header.csv')
stocks_df.head()

	2013-02-08	15.07	15.12	14.63	14.75	8407500     AAL
0	2013-02-11	14.89	15.01	14.26	14.46	8882000	    AAL
1	2013-02-12	14.45	14.51	14.10	14.27	8126000	    AAL
2	2013-02-13	14.30	14.94	14.25	14.66	10259500    AL
3	2013-02-14	14.94	14.96	13.16	13.99	31879900    AAL
4	2013-02-15	13.93	14.61	13.93	14.50	15628000    AAL

pandas assumes the first line is a header, and reads it in as such. In this case, this isn't what we want, since that first row is data. Fortunately, the fix is easy — just specify the option header=None.

stocks_df = pd.read_csv('stocks_no_header.csv', header=None)
stocks_df.head()

    0           1       2       3       4       5           6
0	2013-02-08	15.07	15.12	14.63	14.75	8407500	    AAL
1	2013-02-11	14.89	15.01	14.26	14.46	8882000	    AAL
2	2013-02-12	14.45	14.51	14.10	14.27	8126000	    AAL
3	2013-02-13	14.30	14.94	14.25	14.66	10259500    AAL
4	2013-02-14	14.94	14.96	13.16	13.99	31879900    AAL

Now pandas has numbered the columns starting at 0. If you prefer to have named columns instead, use the names option to specify your own column names.

col_names = ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Ticker Symbol']
stocks_df = pd.read_csv('stocks_no_header.csv', header=None, names=col_names)
stocks_df.head()

	Date	Open	High	Low	Close	Volume	Ticker    Symbol
0	2013-02-08	15.07	15.12	14.63	14.75	8407500	  AAL
1	2013-02-11	14.89	15.01	14.26	14.46	8882000	  AAL
2	2013-02-12	14.45	14.51	14.10	14.27	8126000	  AAL
3	2013-02-13	14.30	14.94	14.25	14.66	10259500  AAL
4	2013-02-14	14.94	14.96	13.16	13.99	31879900  AAL

Directories

In the example above, I saved the data file as stocks.csv in the same directory as my Python program. To keep things organized, it is quite common to save the data files in a sub directory. In this case, specify the data sub-directory in the read_csv statement.

df = pd.read_csv('data/stocks.csv')

As before, let's check the first few rows to make sure all went well.

df.head()

    date        open    high    low     close   volume      Name
0	2013-02-08	15.07	15.12	14.63	14.75	8407500     AAL
1	2013-02-11	14.89	15.01	14.26	14.46	8882000     AAL
2	2013-02-12	14.45	14.51	14.10	14.27	8126000     AAL
3	2013-02-13	14.30	14.94	14.25	14.66	10259500    AAL
4	2013-02-14	14.94	14.96	13.16	13.99	31879900    AAL

Delimiters

Not all data files separate values using commas. Tabs, spaces, or any other character may be used. For instance, our data file might have been set up with semicolons instead of commas.

date;open;high;low;close;volume;Name
2013-02-08;15.07;15.12;14.63;14.75;8407500;AAL
2013-02-11;14.89;15.01;14.26;14.46;8882000;AAL
2013-02-12;14.45;14.51;14.1;14.27;8126000;AAL
2013-02-13;14.3;14.94;14.25;14.66;10259500;AAL
2013-02-14;14.94;14.96;13.16;13.99;31879900;AAL

To specify a separator other than a comma, use the sep option.

stocks_df = pd.read_csv('stocks_semicolon.csv', sep=';')
stocks_df.head()

    date        open    high    low     close   volume     Name
0	2013-02-08	15.07	15.12	14.63	14.75	8407500	   AAL
1	2013-02-11	14.89	15.01	14.26	14.46	8882000	   AAL
2	2013-02-12	14.45	14.51	14.10	14.27	8126000	   AAL
3	2013-02-13	14.30	14.94	14.25	14.66	10259500   AAL
4	2013-02-14	14.94	14.96	13.16	13.99	31879900   AAL

Specifying a column to be the index

If you don’t want the default index, you may specify a column to be used as the index of the data frame. For instance, to specify the index to be the date column, use index_col='date'.

	        open    high    low     close   volume      Name
date						
2013-02-08	15.07	15.12	14.63	14.75	8407500     AAL
2013-02-11	14.89	15.01	14.26	14.46	8882000     AAL
2013-02-12	14.45	14.51	14.10	14.27	8126000     AAL
2013-02-13	14.30	14.94	14.25	14.66	10259500    AAL
2013-02-14	14.94	14.96	13.16	13.99	31879900    AAL

Now the date column is the index, as desired.

Types

Let's check the type of each column.

stocks_df = pd.read_csv('stocks.csv')
stocks_df.dtypes

date    object
open    float64
high    float64
low     float64
close   float64
volume  int64
Name    object
dtype: object

Let's suppose we want to store the numerical data as 32 bit types to save space. To do so, set the option dtype to a dictionary in which the keys are the columns and the values are the desired types.

types = {
    'open': np.float32,
    'high': np.float32,
    'low': np.float32,
    'close': np.float32,
    'volume': np.int32
}
stocks_df = pd.read_csv('stocks.csv', dtype=types)
stocks_df.dtypes

date   object
open   float32
high   float32
low    float32
close  float32
volume int32
Name   object
dtype: object

The date column was read in as a generic object type, even though we know that the data consists of dates. To read the column in as date type, use the parse_dates option set to a list containing the index of the column.

date    datetime64[ns]
open    float64
high    float64
low     float64
close   float64
volume  int64
Name    object
dtype: object

Missing values

It is common for data to contain missing values. It is also common for missing values to be indicated by some sort of indicator, such as a question mark or the word “missing”.

stocks_df = pd.read_csv('stocks_missing_data.csv')
stocks_df.head()

    data        open    high    low     close   volume      Name
0   2013-02-08  NaN     NaN     14.63   14.75   8407500     AAL
1   2013-02-11  14.89   NaN     14.26   14.46   8882000     AAL
2   2013-02-12  14.45   14.51   ?       14.27   8126000     AAL
3   2013-02-13  14.30   14.94   14.25   14.66   10259500    AAL
4   2013-02-14  14.94   14.96   13.16   13.99   31879900    AAL

As you can see, there are two values missing from the first row of data and one from the second row. In addition, the third data row has a question mark in one of the entries. How do we deal with this?

The missing values were read in as “Not a Number” (NaN) by pandas as we would expect, so there is nothing further required from us for these values. However, pandas does not know what to do with the question mark indicating a missing value. To rectify this situation, we can pass a list of values to be considered missing values in a list to the na_values option.

stocks_df = pd.read_csv('stocks_missing_data.csv', na_values=['?'])
stocks_df.head()

    date        open    high    low     close   volume      Name
0   2013-02-08  NaN     NaN     14.63   14.75   8407500     AAL
1   2013-02-11  14.89   NaN     14.26   14.46   8882000     AAL
2   2013-02-12  14.45   14.51   NaN     14.27   8126000     AAL
3   2013-02-13  14.30   14.94   14.25   14.66   10259500    AAL
4   2013-02-14  14.94   14.96   13.16   13.99   31879900    AAL

Now the question mark has been read in as a NaN.