Speed up Python's pandas slow read

Pandas’ read_excel is way too slow

pandas.read_excel() is slow — even on small datasets (under 50k rows) it can take minutes. The fix: convert the xlsx files to csv first, then use pandas.read_csv().

Getting started

Grab xlsx2csv:

git clone https://github.com/dilshod/xlsx2csv.git

Batch-convert xlsx to csv

Put your script in the same folder as xlsx2csv.py (or use absolute paths).

import glob
import re
import subprocess

glob lists the xlsx files, re extracts the filename, subprocess invokes xlsx2csv.py.

xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path + '*.xlsx')

Run python xlsx2csv.py with no args to see the manual:

$ python xlsx2csv.py
usage: xlsx2csv.py [-h] [-v] [-a] [-c OUTPUTENCODING] [-d DELIMITER] [--hyperlinks] [-e]
                   [-E EXCLUDE_SHEET_PATTERN [EXCLUDE_SHEET_PATTERN ...]] [-f DATEFORMAT] [-t TIMEFORMAT]
                   [--floatformat FLOATFORMAT] [--sci-float] [-I INCLUDE_SHEET_PATTERN [INCLUDE_SHEET_PATTERN ...]]
                   [--ignore-formats IGNORE_FORMATS [IGNORE_FORMATS ...]] [-l LINETERMINATOR] [-m] [-n SHEETNAME] [-i]
                   [--skipemptycolumns] [-p SHEETDELIMITER] [-q QUOTING] [-s SHEETID]
                   xlsxfile [outfile]

So converting one file is python xlsx2csv.py [xlsx_filename] [csv_filename]. Loop that with subprocess.call:

for xlsx in list_of_xlsx:
    filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
    call = ["python", "./xlsx2csv.py", xlsx, csv_path + filename + '.csv']
    try:
        subprocess.call(call)  # On Windows use shell=True
    except:
        print('Failed with {}'.format(xlsx))

Then concatenate all csvs into one:

outputcsv = './data/bigcsv.csv'

listofdataframes = []
for file in glob.glob(csv_path + '*.csv'):
    df = pd.read_csv(file)
    if df.shape[1] == 24:  # expect 24 columns
        listofdataframes.append(df)
    else:
        print('{}  has {} columns - skipping'.format(file, df.shape[1]))

bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv, index=False)

The xlsx → csv conversion is now the bottleneck. Parallelize it.

Speeding it up with threads

Because this is I/O-bound, multithreading helps (despite the GIL).

Build the command list:

commands = []
for filepath in glob.glob(xlsx_path + '*.xlsx'):
    filename = re.search(r'(.+[\\|\/])(.+)(\.(csv|xlsx|xlx))', filepath)
    call = ["python", "./xlsx2csv.py", filepath, csv_path + '{}.csv'.format(filename.group(2))]
    commands.append(call)

Then fan out with multiprocessing.dummy:

from multiprocessing.dummy import Pool

pool = Pool(2)  # number of concurrent threads

# On Windows wrap subprocess.call with functools.partial(subprocess.call, shell=True)
for i, return_code in enumerate(pool.imap(subprocess.call, commands)):
    if return_code != 0:
        print("Command # {} failed with return code {}.".format(i, return_code))

%%timeit on my laptop:

1 thread: 27.1 s ± 2.76 s per loop
2 threads: 13 s ± 3.91 s per loop

Scale threads to your machine.

← Back to blog

Table of contents

Speed up Python's pandas slow read_excel()

Pandas’ read_excel is way too slow

Getting started

Batch-convert xlsx to csv

Speeding it up with threads