10. Use Minimal Data Type
Column Data type Default size Data type Minimal size
ip int64 1.38GB uint32 0.69GB
app int64 1.38GB uint16 0.34GB
device int64 1.38GB uint16 0.34GB
os int64 1.38GB uint16 0.34GB
channel int64 1.38GB uint16 0.34GB
click_time object 13.1GB datetime64[ns] 1.38GB
attributed_time object 5.53GB datetime64[ns] 1.38GB
is_attributed int64 1.37GB uint8 0.17GB
total 26.9GB 5GB
df.info(merory_usage=’deep’)
df.merory_usage(deep=True)
11. Garbage Collection
● Remove obsolete variables: del var
● Manual garbage collection immediately after
removal
import gc
del var
gc.collect()
Note: “gc” for Generational Cyclic, detecting
reference cycles.
14. Intermediate Data Storage
CSV hdf5 feather parquet
read 260 sec 204/27.2 sec 85/131 sec 29.9/162 sec
write 822 sec 59.8 sec 22.8 sec 26.3 sec
data type lost ✔kept ✔kept ✔kept
row index not applicable no constraint default default
15. Better Model or Fast Iteration
Better Model Faster Iteration
More computing resources and time More repetitions
Better results, prospect of prize $$ Debugging algorithm and model
Less likely overfitting Overfitting
Final stage Early stage
Tips: Increasing the training data and model complexity
progressively in model building process.
16. Take-homes
● Memory management
○ Tools
○ Choose minimal data types
○ Garbage collection after removing obsolete variables
● Time management
○ Tools
○ Break down tasks
○ Store binary intermediate results
○ Trade-off between better models and faster iteration
17. Thanks for Your Attention!
● LinkedIn: linkedin.com/in/enfeizhan/
● GitHub: github.com/enfeizhan
● Kaggle: kaggle.com/enfeizhan
● Email: enfeizhan@gmail.com
Thanks, Enrique! And thank you for changing your plan to host tonight. Thank you very much for coming!
0
Tonight I would like to share with you my lessons after surviving a recent kaggle competition.
A bit about myself: You probably have seen it from the meet-up page. I studied theoretical physics and have been working as data scientist/analyst for more three years and a half. Recently just started at Servian as a consultant. Servian is a consultancy agent, which helps clients with data analytics, cloud technologies and so on. For those who are job hunting there would be a bonus in the end. So please stay on even you think the contents are boring :) You are welcomed to get connected through each one of these.
Before we get into business, viewer discretion if it’s relevant to you. You will see in this talk I use python pandas library and run calculations in jupyter notebook. That’s what I think a python logo should be.
Speaking of what big data is, the definition that convinces the most is this three V’s. For today’s lessons, we’ll be focusing on volume. For a big cloud cluster, the data size today is just crumbs. But for an even pretty good home PC, it’s still a challenge. And that’s why I learned these lessons.
First I would briefly walk you through the goal of the kaggle competition. And long story short this would be about managing memory and iterating more efficiently.
In this kaggle competition, you are challenged to predict whether a Chinese mobile user would install an advertised mobile app after clicking on an ad. Here is a probably not the best example. Found this one after browsing an hour on my phone. Here you can see this is an ad for a game app on a news page. Your click on this link leads to the app downloading page. What you just did would be recorded, which includes your ip, this app, your device (in my case an iphone 6), the operating system (in my case ios 11.4), the channel, which is this news site, the click time. And if you did download, your downloading time and 1 for is_attributed. So the data fall into the category of time series. For training, you are provided with 3, 4 days of data and predict this is_attributed for the following day.
We got about 7 Gig for training set and 800 meg test set. Below is the resources on my PC, about 78 gig in total including the swap. You might say that’s not too bad. But trust me this is similar to play russian tetris. Though setting out you have seemingly a lot of room, if you aren’t cautious about memory usage, first you would be using you swap, which is terribly slower and even worse the total memory drains away completely quickly.
First you would have to watch how your code is memory hungry in real time to figure out what steps deplete your ram and what steps should be further improved to be able to run follow-up steps. For this I use the linux utility free which can print out the memory usage in real time, given the correct flags and parameters, which is a bit tricky. I’ll show you in the next slide.
You can set time interval at which you want free to print out the memory usage. Unfortunately you need a bit more in order to make that happen. I’ll play this video. See you can set a big number which is the total number of counts. And these two have to be correct order.
Next you gotta minimise your memory use as much as possible. The first thing, when you load the dataset, you should do is choose the least data type for your data. Not surprisingly, the default data type would be the most memory consuming one. As you can see in this table, 64 bits is default for integers. And the timestamps are treated as strings, which also claim more memory than actually necessary. Even at a glance you can find the memory cost is dramatically reduced by choosing the data type that is just enough for the data. For example no negative occurrences so we better off just use unsigned integer types and must parse the timestamps to save a lot of memory.
A side note: to get these info you can do these two if you use pandas for analysis.
While choosing best data type saves a lot of ram at the set-out, ram is still frequently at risk at run time for a situation like this we have. Again, russian tetris gives you hints of how to solve this problem as in both scenarios you are challenged with limited real estate. As it is well-known that the success of playing tetris depends on how efficiently blocks are removed. Similarly, we remove variables once they become obsolete to release the memory they occupied. Generally this is enough, python would automatically release the memory after you deleted the variable. However, you do need to manually claim back the memory by calling the collect function from the gc module. To be honest, I didn’t know this before this competition. BTW, don’t get confused with gc for garbage collection. Turns out it stands for “generational cycle”, where two variables reference to each other. And that’s the core of reason where memory doesn’t get released by python. Because python got confused by the cyclic reference and is reluctant to reclaim the memory. You have to call gc module for that. In practice, I even went as far as removing obsolete variables in each step of a for loop as well as garbage collection, which helps phenomenally.
All of the above was about managing memories. Now we discuss how to save time in building models. First thing up, you would have to know how much time your calculation costs. For this purpose Jupyter got the magic command %%time, which times all the calculations in a cell. Let’s see the demo in the video.
With the magic command at our disposal, I discovered pitfalls we wouldn’t have imagined of. Like this one. I want to see a high level description of the data. Presumably the time taken is roughly the sum of time taken for each column. However, for this example, the time taken for two columns separately is shorter than them together. My guess would be something about available memory to carry out the calculation or paging sort of stuff. Key take-away: think twice before running a big chunk of calculation. And split the calculation and monitor the memory taken.
As discovered in many surveys among data scientists, most of our time has been spent on feature engineering and data clearning, where there would be lots of intermediate data. Very likely you need to store them for later use. You can store them in csv files. That’s ok. I am not against that. But you deserve something better. Because storing in csv loses data types like datetime and you have to parse it next time you reload it. Exactly like McKinney said in one of his tweets. BTW he is the author of pandas. In addition to data type, reading and writing csv files is much slower than other format listed here in the table. For comparison, I read and write a same file. These are only supposed to give you a general idea how different they are. As we can writing with these three formats are much faster. Reading is interesting. While hdf5 takes more time to read for the first time in a session, these two take less. Personally I like to use hdf5 though it doesn’t look the best but other two format gots an annoying constraint. That is the row index of a dataframe has to be from 0 to N while it could be anything of hdf5. This comes in handy when you want to carry information with the index.
Lastly I wanna touch a bit on the trade-off between a better model and a quicker model. This may not seem important for a small dataset but have a big impact on your work when it’s bigger. The key reason here is repeatedly trying different features, algorithms, and models is inevitable. Taking too much time on each iteration won’t meet the deadline for a competition or project. In most cases, consider using less data to train a less performant model and gain insights from there would be benificial at the start of the project.
That’s all I want to say. You can take home some hands-on experiences about managing memory and time more efficiently.
Thank you for your attention! And again feel free to contact me on any of these methods.
One last thing, servian is hiring! Talk to me if you are looking.