Speed and time is a key factor for any Data Scientist. In business, you do not usually work with toy datasets having thousands of samples. It is more likely that your datasets will contain millions or hundreds of millions samples. Customer orders, web logs, billing events, stock prices – datasets now are huge.
I assume you do not want to spend hours or days, waiting for your data processing to complete. The biggest dataset I worked with so far contained over 30 million of records. When I run my data processing script the first time for this dataset, estimated time to complete was around 4 days! I do not have very powerful machine (Macbook Air with i5 and 4 GB of RAM), but the most I could accept was running the script over one night, not multiple days.
Thanks to some clever tricks, I was able to decrease this running time to a few hours. This post will explain the first step to achieve good data processing performance – choosing right library/framework for your dataset.
The graph below shows result of my experiment (details below), calculated as processing speed measured against processing speed of pure Python.
As you can see, Numpy performance is several times bigger than Pandas performance. I personally love Pandas for simplifying many tedious data science tasks, and I use it wherever I can. But if the expected processing time spans for more than many hours, then, with regret, I change Pandas to Numpy.
I am very aware that the actual performance may vary significantly, depending on a task and type of processing. So please, treat these result as indicative only. There is no single test that can shown “overall” comparison of performance for any set of software tools.
Posted on July 15, 2017 by