In this recipe we’ll look into the topic of loading text files in Pandas dataframes. for further data wrangling for visualization purposes or as a preparatory step for Machine Learning.
Use pd.read_csv() to load text file with tab delimiters
Let’s outline this using a simple example. Suppose we have a text file that has several rows. The values are separated by a tab delimiter. If you want to follow along, you can copy and paste the text into a .txt file:
Person sales
Liam 120
Harry 180
Louis 100
Niall 150
Let’s now go ahead and load it to Pandas. In this example i suumed the file is saved in a folder named data (will work in Windows, Linux or macOS).
import pandas as pd
sales= pd.read_csv('../data/sales_tab.txt')
sales.head()
As you can see below, Pandas didn’t read correctly the tabstops in your file.
Person\tsales | |
---|---|
0 | Liam\t120 |
1 | Harry\t180 |
2 | Louis\t100 |
3 | Niall\t150 |
The solution is to swith the delimiter=’\t’ parameter of the pd.read_csv() function to define the tabspace as the delimiting character.
sales= pd.read_csv('../data/sales_tab.txt', delimiter='\t')
sales.head()
This is pretty cool. with a couple of code rows you got the text file line by line (including the headers) into your Pandas sales dataframe and now you can start to work with the data.
Person | sales | |
---|---|---|
0 | Liam | 120 |
1 | Harry | 180 |
2 | Louis | 100 |
3 | Niall | 150 |
Note that you can obviously customize the column headers as needed, by using the names parameter, as shown below.
sales= pd.read_csv('../data/sales_tab.txt', delimiter='\t', names=['singer', 'net worth'])
Read text with no spaces into Pandas
What of the content of the text file is just separated by commas, like a .csv file and the file is encoded as utf-16, as it was probably saved from Excel?
This won’t work:
sales2= pd.read_csv('../data/sales_txt.txt')
You’ll get an encoding error: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte
Solution is to use the encoding parameter and set it to utf-16 or other relevant encoding as needed:
sales2= pd.read_csv('../data/sales_txt.txt', \
encoding='utf-16')