Survive Big Data with Tips from Kaggle

•Download as PPTX, PDF•

0 likes•38 views

Fei Zhan

Useful hands-on knowledge for doing feature engineering on ram challenging datasets.

Data & Analytics

Tips & Tricks to Survive from “Big” Data
A Few Lessons from Kaggle Competition

About Me
● Consultant at : www.servian.com
● LinkedIn: linkedin.com/in/enfeizhan/
● GitHub: github.com/enfeizhan
● Kaggle: kaggle.com/enfeizhan
● Email: enfeizhan@gmail.com

R viewers are advised that the following
slides contain Pythonic languages and
violence that may be disturbing.

Overview
● Intro to Kaggle challenge
● Memory Management
○ Data type choice
○ Keep everything clean
● Speed management
○ Immediate data storage
○ Faster vs better trade-off

Size of Data & Compute Resources
Data size:
Memories of home PC:
WATCH OUT!

Memory Monitoring
Use free to watch memory usage:

Use Minimal Data Type
Column Data type Default size Data type Minimal size
ip int64 1.38GB uint32 0.69GB
app int64 1.38GB uint16 0.34GB
device int64 1.38GB uint16 0.34GB
os int64 1.38GB uint16 0.34GB
channel int64 1.38GB uint16 0.34GB
click_time object 13.1GB datetime64[ns] 1.38GB
attributed_time object 5.53GB datetime64[ns] 1.38GB
is_attributed int64 1.37GB uint8 0.17GB
total 26.9GB 5GB
df.info(merory_usage=’deep’)
df.merory_usage(deep=True)

Garbage Collection
● Remove obsolete variables: del var
● Manual garbage collection immediately after
removal
import gc
del var
gc.collect()
Note: “gc” for Generational Cyclic, detecting
reference cycles.

Time Consumption Monitoring
Use Jupyter Notebook magic built-in command: %%time

Intermediate Data Storage
CSV hdf5 feather parquet
read 260 sec 204/27.2 sec 85/131 sec 29.9/162 sec
write 822 sec 59.8 sec 22.8 sec 26.3 sec
data type lost ✔kept ✔kept ✔kept
row index not applicable no constraint default default

Better Model or Fast Iteration
Better Model Faster Iteration
More computing resources and time More repetitions
Better results, prospect of prize $$ Debugging algorithm and model
Less likely overfitting Overfitting
Final stage Early stage
Tips: Increasing the training data and model complexity
progressively in model building process.

Take-homes
● Memory management
○ Tools
○ Choose minimal data types
○ Garbage collection after removing obsolete variables
● Time management
○ Tools
○ Break down tasks
○ Store binary intermediate results
○ Trade-off between better models and faster iteration

Thanks for Your Attention!
● LinkedIn: linkedin.com/in/enfeizhan/
● GitHub: github.com/enfeizhan
● Kaggle: kaggle.com/enfeizhan
● Email: enfeizhan@gmail.com

Similar to Survive Big Data with Tips from Kaggle

Model selection and tuning at scaleOwen Zhang

Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon

Data Science in the Cloud @StitchFixC4Media

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit

PyGrunn2013 High Performance Web Applications with TurboGearsAlessandro Molina

Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty

Optimizing PythonAdimianBE

Sidiq Permana - Building For The Next Billion UsersDicoding

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB

Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger

Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari

Tips for data science competitionsOwen Zhang

Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli

Adtech x Scala x Performance tuningYosuke Mizutani

Tech meetup: Web Applications PerformanceSantex Group

Spark MeetupSahan Bulathwela

Serverless? How (not) to develop, deploy and operate serverless applications.gjdevos

How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...Red Hat Developers

Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce

OutSystems Tips and TricksOutSystems

Similar to Survive Big Data with Tips from Kaggle (20)

Model selection and tuning at scale

Production ready big ml workflows from zero to hero daniel marcous @ waze

Data Science in the Cloud @StitchFix

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

PyGrunn2013 High Performance Web Applications with TurboGears

Lessons learned from designing QA automation event streaming platform(IoT big...

Optimizing Python

Sidiq Permana - Building For The Next Billion Users

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

Elasticsearch Performance Testing and Scaling @ Signal

Monitoring Big Data Systems - "The Simple Way"

Tips for data science competitions

Adtech scala-performance-tuning-150323223738-conversion-gate01

Adtech x Scala x Performance tuning

Tech meetup: Web Applications Performance

Spark Meetup

Serverless? How (not) to develop, deploy and operate serverless applications.

How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...

Aws uk ug #8 not everything that happens in vegas stay in vegas

OutSystems Tips and Tricks

Recently uploaded

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Discover Why Less is More in B2B Researchmichael115558

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Midocean dropshipping via API with DroFxolyaivanovalion

Week-01-2.ppt BBB human Computer interactionfulawalesam

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Invezz.com - Grow your wealth with trading signalsInvezz1

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Introduction-to-Machine-Learning (1).pptxfirstjob4

Recently uploaded (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

100-Concepts-of-AI by Anupama Kate .pptx

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Determinants of health, dimensions of health, positive health and spectrum of...

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Discover Why Less is More in B2B Research

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...

Carero dropshipping via API with DroFx.pptx

Midocean dropshipping via API with DroFx

Week-01-2.ppt BBB human Computer interaction

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

Sampling (random) method and Non random.ppt

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

Ravak dropshipping via API with DroFx.pptx

Invezz.com - Grow your wealth with trading signals

CebaBaby dropshipping via API with DroFX.pptx

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...

Introduction-to-Machine-Learning (1).pptx

Survive Big Data with Tips from Kaggle

1. Tips & Tricks to Survive from “Big” Data A Few Lessons from Kaggle Competition

2. About Me ● Consultant at : www.servian.com ● LinkedIn: linkedin.com/in/enfeizhan/ ● GitHub: github.com/enfeizhan ● Kaggle: kaggle.com/enfeizhan ● Email: enfeizhan@gmail.com

3. R viewers are advised that the following slides contain Pythonic languages and violence that may be disturbing.

5. Overview ● Intro to Kaggle challenge ● Memory Management ○ Data type choice ○ Keep everything clean ● Speed management ○ Immediate data storage ○ Faster vs better trade-off

6. TalkingData Ad Tracking

7. Size of Data & Compute Resources Data size: Memories of home PC: WATCH OUT!

8. Memory Monitoring Use free to watch memory usage:

10. Use Minimal Data Type Column Data type Default size Data type Minimal size ip int64 1.38GB uint32 0.69GB app int64 1.38GB uint16 0.34GB device int64 1.38GB uint16 0.34GB os int64 1.38GB uint16 0.34GB channel int64 1.38GB uint16 0.34GB click_time object 13.1GB datetime64[ns] 1.38GB attributed_time object 5.53GB datetime64[ns] 1.38GB is_attributed int64 1.37GB uint8 0.17GB total 26.9GB 5GB df.info(merory_usage=’deep’) df.merory_usage(deep=True)

11. Garbage Collection ● Remove obsolete variables: del var ● Manual garbage collection immediately after removal import gc del var gc.collect() Note: “gc” for Generational Cyclic, detecting reference cycles.

12. Time Consumption Monitoring Use Jupyter Notebook magic built-in command: %%time

13. Think Twice 8.88 + 9.81 = 37.4 ?????

14. Intermediate Data Storage CSV hdf5 feather parquet read 260 sec 204/27.2 sec 85/131 sec 29.9/162 sec write 822 sec 59.8 sec 22.8 sec 26.3 sec data type lost ✔kept ✔kept ✔kept row index not applicable no constraint default default

15. Better Model or Fast Iteration Better Model Faster Iteration More computing resources and time More repetitions Better results, prospect of prize $$ Debugging algorithm and model Less likely overfitting Overfitting Final stage Early stage Tips: Increasing the training data and model complexity progressively in model building process.

16. Take-homes ● Memory management ○ Tools ○ Choose minimal data types ○ Garbage collection after removing obsolete variables ● Time management ○ Tools ○ Break down tasks ○ Store binary intermediate results ○ Trade-off between better models and faster iteration

17. Thanks for Your Attention! ● LinkedIn: linkedin.com/in/enfeizhan/ ● GitHub: github.com/enfeizhan ● Kaggle: kaggle.com/enfeizhan ● Email: enfeizhan@gmail.com

18. IS HIRING!

Editor's Notes

Thanks, Enrique! And thank you for changing your plan to host tonight. Thank you very much for coming! 0 Tonight I would like to share with you my lessons after surviving a recent kaggle competition.
A bit about myself: You probably have seen it from the meet-up page. I studied theoretical physics and have been working as data scientist/analyst for more three years and a half. Recently just started at Servian as a consultant. Servian is a consultancy agent, which helps clients with data analytics, cloud technologies and so on. For those who are job hunting there would be a bonus in the end. So please stay on even you think the contents are boring :) You are welcomed to get connected through each one of these.
Before we get into business, viewer discretion if it’s relevant to you. You will see in this talk I use python pandas library and run calculations in jupyter notebook. That’s what I think a python logo should be.
Speaking of what big data is, the definition that convinces the most is this three V’s. For today’s lessons, we’ll be focusing on volume. For a big cloud cluster, the data size today is just crumbs. But for an even pretty good home PC, it’s still a challenge. And that’s why I learned these lessons.
First I would briefly walk you through the goal of the kaggle competition. And long story short this would be about managing memory and iterating more efficiently.
In this kaggle competition, you are challenged to predict whether a Chinese mobile user would install an advertised mobile app after clicking on an ad. Here is a probably not the best example. Found this one after browsing an hour on my phone. Here you can see this is an ad for a game app on a news page. Your click on this link leads to the app downloading page. What you just did would be recorded, which includes your ip, this app, your device (in my case an iphone 6), the operating system (in my case ios 11.4), the channel, which is this news site, the click time. And if you did download, your downloading time and 1 for is_attributed. So the data fall into the category of time series. For training, you are provided with 3, 4 days of data and predict this is_attributed for the following day.
We got about 7 Gig for training set and 800 meg test set. Below is the resources on my PC, about 78 gig in total including the swap. You might say that’s not too bad. But trust me this is similar to play russian tetris. Though setting out you have seemingly a lot of room, if you aren’t cautious about memory usage, first you would be using you swap, which is terribly slower and even worse the total memory drains away completely quickly.
First you would have to watch how your code is memory hungry in real time to figure out what steps deplete your ram and what steps should be further improved to be able to run follow-up steps. For this I use the linux utility free which can print out the memory usage in real time, given the correct flags and parameters, which is a bit tricky. I’ll show you in the next slide.
You can set time interval at which you want free to print out the memory usage. Unfortunately you need a bit more in order to make that happen. I’ll play this video. See you can set a big number which is the total number of counts. And these two have to be correct order.
Next you gotta minimise your memory use as much as possible. The first thing, when you load the dataset, you should do is choose the least data type for your data. Not surprisingly, the default data type would be the most memory consuming one. As you can see in this table, 64 bits is default for integers. And the timestamps are treated as strings, which also claim more memory than actually necessary. Even at a glance you can find the memory cost is dramatically reduced by choosing the data type that is just enough for the data. For example no negative occurrences so we better off just use unsigned integer types and must parse the timestamps to save a lot of memory. A side note: to get these info you can do these two if you use pandas for analysis.
While choosing best data type saves a lot of ram at the set-out, ram is still frequently at risk at run time for a situation like this we have. Again, russian tetris gives you hints of how to solve this problem as in both scenarios you are challenged with limited real estate. As it is well-known that the success of playing tetris depends on how efficiently blocks are removed. Similarly, we remove variables once they become obsolete to release the memory they occupied. Generally this is enough, python would automatically release the memory after you deleted the variable. However, you do need to manually claim back the memory by calling the collect function from the gc module. To be honest, I didn’t know this before this competition. BTW, don’t get confused with gc for garbage collection. Turns out it stands for “generational cycle”, where two variables reference to each other. And that’s the core of reason where memory doesn’t get released by python. Because python got confused by the cyclic reference and is reluctant to reclaim the memory. You have to call gc module for that. In practice, I even went as far as removing obsolete variables in each step of a for loop as well as garbage collection, which helps phenomenally.
All of the above was about managing memories. Now we discuss how to save time in building models. First thing up, you would have to know how much time your calculation costs. For this purpose Jupyter got the magic command %%time, which times all the calculations in a cell. Let’s see the demo in the video.
With the magic command at our disposal, I discovered pitfalls we wouldn’t have imagined of. Like this one. I want to see a high level description of the data. Presumably the time taken is roughly the sum of time taken for each column. However, for this example, the time taken for two columns separately is shorter than them together. My guess would be something about available memory to carry out the calculation or paging sort of stuff. Key take-away: think twice before running a big chunk of calculation. And split the calculation and monitor the memory taken.
As discovered in many surveys among data scientists, most of our time has been spent on feature engineering and data clearning, where there would be lots of intermediate data. Very likely you need to store them for later use. You can store them in csv files. That’s ok. I am not against that. But you deserve something better. Because storing in csv loses data types like datetime and you have to parse it next time you reload it. Exactly like McKinney said in one of his tweets. BTW he is the author of pandas. In addition to data type, reading and writing csv files is much slower than other format listed here in the table. For comparison, I read and write a same file. These are only supposed to give you a general idea how different they are. As we can writing with these three formats are much faster. Reading is interesting. While hdf5 takes more time to read for the first time in a session, these two take less. Personally I like to use hdf5 though it doesn’t look the best but other two format gots an annoying constraint. That is the row index of a dataframe has to be from 0 to N while it could be anything of hdf5. This comes in handy when you want to carry information with the index.
Lastly I wanna touch a bit on the trade-off between a better model and a quicker model. This may not seem important for a small dataset but have a big impact on your work when it’s bigger. The key reason here is repeatedly trying different features, algorithms, and models is inevitable. Taking too much time on each iteration won’t meet the deadline for a competition or project. In most cases, consider using less data to train a less performant model and gain insights from there would be benificial at the start of the project.
That’s all I want to say. You can take home some hands-on experiences about managing memory and time more efficiently.
Thank you for your attention! And again feel free to contact me on any of these methods.
One last thing, servian is hiring! Talk to me if you are looking.

Survive Big Data with Tips from Kaggle

Recommended

Recommended

More Related Content

Similar to Survive Big Data with Tips from Kaggle

Similar to Survive Big Data with Tips from Kaggle (20)

Recently uploaded

Recently uploaded (20)

Survive Big Data with Tips from Kaggle

Editor's Notes