Speed up Python's pandas slow read_excel()
Pandas’ read_excel is way too slow
pandas.read_excel() is slow — even on small datasets (under 50k rows) it can take minutes. The fix: convert the xlsx files to csv first, then use pandas.read_csv().
Getting started
Grab xlsx2csv:
git clone https://github.com/dilshod/xlsx2csv.gitBatch-convert xlsx to csv
Put your script in the same folder as xlsx2csv.py (or use absolute paths).
import globimport reimport subprocessglob lists the xlsx files, re extracts the filename, subprocess invokes xlsx2csv.py.
xlsx_path = './data/Extract/'csv_path = './data/csv/'list_of_xlsx = glob.glob(xlsx_path + '*.xlsx')Run python xlsx2csv.py with no args to see the manual:
$ python xlsx2csv.pyusage: xlsx2csv.py [-h] [-v] [-a] [-c OUTPUTENCODING] [-d DELIMITER] [--hyperlinks] [-e] [-E EXCLUDE_SHEET_PATTERN [EXCLUDE_SHEET_PATTERN ...]] [-f DATEFORMAT] [-t TIMEFORMAT] [--floatformat FLOATFORMAT] [--sci-float] [-I INCLUDE_SHEET_PATTERN [INCLUDE_SHEET_PATTERN ...]] [--ignore-formats IGNORE_FORMATS [IGNORE_FORMATS ...]] [-l LINETERMINATOR] [-m] [-n SHEETNAME] [-i] [--skipemptycolumns] [-p SHEETDELIMITER] [-q QUOTING] [-s SHEETID] xlsxfile [outfile]So converting one file is python xlsx2csv.py [xlsx_filename] [csv_filename]. Loop that with subprocess.call:
for xlsx in list_of_xlsx: filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2) call = ["python", "./xlsx2csv.py", xlsx, csv_path + filename + '.csv'] try: subprocess.call(call) # On Windows use shell=True except: print('Failed with {}'.format(xlsx))Then concatenate all csvs into one:
outputcsv = './data/bigcsv.csv'
listofdataframes = []for file in glob.glob(csv_path + '*.csv'): df = pd.read_csv(file) if df.shape[1] == 24: # expect 24 columns listofdataframes.append(df) else: print('{} has {} columns - skipping'.format(file, df.shape[1]))
bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)bigdataframe.to_csv(outputcsv, index=False)The xlsx → csv conversion is now the bottleneck. Parallelize it.
Speeding it up with threads
Because this is I/O-bound, multithreading helps (despite the GIL).
Build the command list:
commands = []for filepath in glob.glob(xlsx_path + '*.xlsx'): filename = re.search(r'(.+[\\|\/])(.+)(\.(csv|xlsx|xlx))', filepath) call = ["python", "./xlsx2csv.py", filepath, csv_path + '{}.csv'.format(filename.group(2))] commands.append(call)Then fan out with multiprocessing.dummy:
from multiprocessing.dummy import Pool
pool = Pool(2) # number of concurrent threads
# On Windows wrap subprocess.call with functools.partial(subprocess.call, shell=True)for i, return_code in enumerate(pool.imap(subprocess.call, commands)): if return_code != 0: print("Command # {} failed with return code {}.".format(i, return_code))%%timeit on my laptop:
- 1 thread:
27.1 s ± 2.76 s per loop - 2 threads:
13 s ± 3.91 s per loop
Scale threads to your machine.
← Back to blog