In this recipe we’ll look into the topic of loading text files in Pandas dataframes. for further data wrangling for visualization purposes or as a preparatory step for Machine Learning.
Use pd.read_csv() to load text file with tab delimiters
Let’s outline this using a simple example. Suppose we have a text file that has several rows. The values are separated by a tab delimiter. If you want to follow along, you can copy and paste the text into a .txt file:
Person sales
Liam 120
Harry 180
Louis 100
Niall 150
Let’s now go ahead and load it to Pandas. In this example i suumed the file is saved in a folder named data (will work in Windows, Linux or macOS).
import pandas as pd
sales= pd.read_csv('../data/sales_tab.txt')
sales.head()
As you can see below, Pandas didn’t read correctly the tabstops in your file.
| Person\tsales | |
|---|---|
| 0 | Liam\t120 |
| 1 | Harry\t180 |
| 2 | Louis\t100 |
| 3 | Niall\t150 |
The solution is to swith the delimiter=’\t’ parameter of the pd.read_csv() function to define the tabspace as the delimiting character.
sales= pd.read_csv('../data/sales_tab.txt', delimiter='\t')
sales.head()
This is pretty cool. with a couple of code rows you got the text file line by line (including the headers) into your Pandas sales dataframe and now you can start to work with the data.
| Person | sales | |
|---|---|---|
| 0 | Liam | 120 |
| 1 | Harry | 180 |
| 2 | Louis | 100 |
| 3 | Niall | 150 |
Note that you can obviously customize the column headers as needed, by using the names parameter, as shown below.
sales= pd.read_csv('../data/sales_tab.txt', delimiter='\t', names=['singer', 'net worth'])
Read text with no spaces into Pandas
What of the content of the text file is just separated by commas, like a .csv file and the file is encoded as utf-16, as it was probably saved from Excel?
This won’t work:
sales2= pd.read_csv('../data/sales_txt.txt')
You’ll get an encoding error: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte
Solution is to use the encoding parameter and set it to utf-16 or other relevant encoding as needed:
sales2= pd.read_csv('../data/sales_txt.txt', \
encoding='utf-16')
Troubleshooting Common Issues
Why does pd.read_csv() misalign columns when reading a text file?
Column misalignment usually happens when the delimiter is not set correctly. Tab-separated files need sep='\t', and files with irregular whitespace require sep='\s+' with engine='python'. Check the raw file in a text editor to identify the actual separator character. Another common cause is quoted fields containing the delimiter itself — setting quotechar='"' resolves that. After loading the data correctly, you might want to export a dataframe column to a Python list to verify that values parsed into the right fields.
How do I skip header rows or footer lines when reading a text file?
Use skiprows to ignore lines at the top and skipfooter for lines at the bottom. For example, pd.read_csv('data.txt', sep='\t', skiprows=3, skipfooter=2, engine='python') skips the first three and last two lines. You can also pass a list to skiprows, such as skiprows=[0, 2, 5], to skip specific non-consecutive lines. Once your data is loaded cleanly, columns that contain combined values can be separated using techniques for splitting one column into multiple columns to normalize your dataset for analysis.
What encoding should I use when pd.read_csv() throws a UnicodeDecodeError?
Start with encoding='utf-8', which covers most modern files. If that fails, try encoding='latin-1' or encoding='cp1252' for files created on Windows systems. You can detect the encoding programmatically with the chardet library by reading the first few thousand bytes of the file. As a last resort, encoding='utf-8' combined with errors='replace' substitutes unreadable characters with a placeholder rather than crashing the import entirely.