In this recipe we’ll look into the topic of loading text files in Pandas dataframes. for further data wrangling for visualization purposes or as a preparatory step for Machine Learning.
Use pd.read_csv() to load text file with tab delimiters
Let’s outline this using a simple example. Suppose we have a text file that has several rows. The values are separated by a tab delimiter. If you want to follow along, you can copy and paste the text into a .txt file:
Person sales
Liam 120
Harry 180
Louis 100
Niall 150
Let’s now go ahead and load it to Pandas. In this example i suumed the file is saved in a folder named data (will work in Windows, Linux or macOS).
import pandas as pd
sales= pd.read_csv('../data/sales_tab.txt')
sales.head()
As you can see below, Pandas didn’t read correctly the tabstops in your file.
| Person\tsales | |
|---|---|
| 0 | Liam\t120 |
| 1 | Harry\t180 |
| 2 | Louis\t100 |
| 3 | Niall\t150 |
The solution is to swith the delimiter=’\t’ parameter of the pd.read_csv() function to define the tabspace as the delimiting character.
sales= pd.read_csv('../data/sales_tab.txt', delimiter='\t')
sales.head()
This is pretty cool. with a couple of code rows you got the text file line by line (including the headers) into your Pandas sales dataframe and now you can start to work with the data.
| Person | sales | |
|---|---|---|
| 0 | Liam | 120 |
| 1 | Harry | 180 |
| 2 | Louis | 100 |
| 3 | Niall | 150 |
Note that you can obviously customize the column headers as needed, by using the names parameter, as shown below.
sales= pd.read_csv('../data/sales_tab.txt', delimiter='\t', names=['singer', 'net worth'])
Read text with no spaces into Pandas
What of the content of the text file is just separated by commas, like a .csv file and the file is encoded as utf-16, as it was probably saved from Excel?
This won’t work:
sales2= pd.read_csv('../data/sales_txt.txt')
You’ll get an encoding error: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte
Solution is to use the encoding parameter and set it to utf-16 or other relevant encoding as needed:
sales2= pd.read_csv('../data/sales_txt.txt', \
encoding='utf-16')