SlideShare a Scribd company logo
1 of 18
Tips & Tricks to Survive from “Big” Data
A Few Lessons from Kaggle Competition
About Me
● Consultant at : www.servian.com
● LinkedIn: linkedin.com/in/enfeizhan/
● GitHub: github.com/enfeizhan
● Kaggle: kaggle.com/enfeizhan
● Email: enfeizhan@gmail.com
R viewers are advised that the following
slides contain Pythonic languages and
violence that may be disturbing.
Overview
● Intro to Kaggle challenge
● Memory Management
○ Data type choice
○ Keep everything clean
● Speed management
○ Immediate data storage
○ Faster vs better trade-off
TalkingData Ad Tracking
Size of Data & Compute Resources
Data size:
Memories of home PC:
WATCH OUT!
Memory Monitoring
Use free to watch memory usage:
Use Minimal Data Type
Column Data type Default size Data type Minimal size
ip int64 1.38GB uint32 0.69GB
app int64 1.38GB uint16 0.34GB
device int64 1.38GB uint16 0.34GB
os int64 1.38GB uint16 0.34GB
channel int64 1.38GB uint16 0.34GB
click_time object 13.1GB datetime64[ns] 1.38GB
attributed_time object 5.53GB datetime64[ns] 1.38GB
is_attributed int64 1.37GB uint8 0.17GB
total 26.9GB 5GB
df.info(merory_usage=’deep’)
df.merory_usage(deep=True)
Garbage Collection
● Remove obsolete variables: del var
● Manual garbage collection immediately after
removal
import gc
del var
gc.collect()
Note: “gc” for Generational Cyclic, detecting
reference cycles.
Time Consumption Monitoring
Use Jupyter Notebook magic built-in command: %%time
Think Twice
8.88 + 9.81 = 37.4 ?????
Intermediate Data Storage
CSV hdf5 feather parquet
read 260 sec 204/27.2 sec 85/131 sec 29.9/162 sec
write 822 sec 59.8 sec 22.8 sec 26.3 sec
data type lost ✔kept ✔kept ✔kept
row index not applicable no constraint default default
Better Model or Fast Iteration
Better Model Faster Iteration
More computing resources and time More repetitions
Better results, prospect of prize $$ Debugging algorithm and model
Less likely overfitting Overfitting
Final stage Early stage
Tips: Increasing the training data and model complexity
progressively in model building process.
Take-homes
● Memory management
○ Tools
○ Choose minimal data types
○ Garbage collection after removing obsolete variables
● Time management
○ Tools
○ Break down tasks
○ Store binary intermediate results
○ Trade-off between better models and faster iteration
Thanks for Your Attention!
● LinkedIn: linkedin.com/in/enfeizhan/
● GitHub: github.com/enfeizhan
● Kaggle: kaggle.com/enfeizhan
● Email: enfeizhan@gmail.com
IS HIRING!

More Related Content

Similar to Survive Big Data with Tips from Kaggle

Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scaleOwen Zhang
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
PyGrunn2013 High Performance Web Applications with TurboGears
PyGrunn2013  High Performance Web Applications with TurboGearsPyGrunn2013  High Performance Web Applications with TurboGears
PyGrunn2013 High Performance Web Applications with TurboGearsAlessandro Molina
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
Optimizing Python
Optimizing PythonOptimizing Python
Optimizing PythonAdimianBE
 
Sidiq Permana - Building For The Next Billion Users
Sidiq Permana - Building For The Next Billion UsersSidiq Permana - Building For The Next Billion Users
Sidiq Permana - Building For The Next Billion UsersDicoding
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuningYosuke Mizutani
 
Tech meetup: Web Applications Performance
Tech meetup: Web Applications PerformanceTech meetup: Web Applications Performance
Tech meetup: Web Applications PerformanceSantex Group
 
Serverless? How (not) to develop, deploy and operate serverless applications.
Serverless? How (not) to develop, deploy and operate serverless applications.Serverless? How (not) to develop, deploy and operate serverless applications.
Serverless? How (not) to develop, deploy and operate serverless applications.gjdevos
 
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...Red Hat Developers
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
 
OutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems
 

Similar to Survive Big Data with Tips from Kaggle (20)

Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
PyGrunn2013 High Performance Web Applications with TurboGears
PyGrunn2013  High Performance Web Applications with TurboGearsPyGrunn2013  High Performance Web Applications with TurboGears
PyGrunn2013 High Performance Web Applications with TurboGears
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Optimizing Python
Optimizing PythonOptimizing Python
Optimizing Python
 
Sidiq Permana - Building For The Next Billion Users
Sidiq Permana - Building For The Next Billion UsersSidiq Permana - Building For The Next Billion Users
Sidiq Permana - Building For The Next Billion Users
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Tech meetup: Web Applications Performance
Tech meetup: Web Applications PerformanceTech meetup: Web Applications Performance
Tech meetup: Web Applications Performance
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
Serverless? How (not) to develop, deploy and operate serverless applications.
Serverless? How (not) to develop, deploy and operate serverless applications.Serverless? How (not) to develop, deploy and operate serverless applications.
Serverless? How (not) to develop, deploy and operate serverless applications.
 
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
OutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems Tips and Tricks
OutSystems Tips and Tricks
 

Recently uploaded

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Recently uploaded (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Survive Big Data with Tips from Kaggle

  • 1. Tips & Tricks to Survive from “Big” Data A Few Lessons from Kaggle Competition
  • 2. About Me ● Consultant at : www.servian.com ● LinkedIn: linkedin.com/in/enfeizhan/ ● GitHub: github.com/enfeizhan ● Kaggle: kaggle.com/enfeizhan ● Email: enfeizhan@gmail.com
  • 3. R viewers are advised that the following slides contain Pythonic languages and violence that may be disturbing.
  • 4.
  • 5. Overview ● Intro to Kaggle challenge ● Memory Management ○ Data type choice ○ Keep everything clean ● Speed management ○ Immediate data storage ○ Faster vs better trade-off
  • 7. Size of Data & Compute Resources Data size: Memories of home PC: WATCH OUT!
  • 8. Memory Monitoring Use free to watch memory usage:
  • 9.
  • 10. Use Minimal Data Type Column Data type Default size Data type Minimal size ip int64 1.38GB uint32 0.69GB app int64 1.38GB uint16 0.34GB device int64 1.38GB uint16 0.34GB os int64 1.38GB uint16 0.34GB channel int64 1.38GB uint16 0.34GB click_time object 13.1GB datetime64[ns] 1.38GB attributed_time object 5.53GB datetime64[ns] 1.38GB is_attributed int64 1.37GB uint8 0.17GB total 26.9GB 5GB df.info(merory_usage=’deep’) df.merory_usage(deep=True)
  • 11. Garbage Collection ● Remove obsolete variables: del var ● Manual garbage collection immediately after removal import gc del var gc.collect() Note: “gc” for Generational Cyclic, detecting reference cycles.
  • 12. Time Consumption Monitoring Use Jupyter Notebook magic built-in command: %%time
  • 13. Think Twice 8.88 + 9.81 = 37.4 ?????
  • 14. Intermediate Data Storage CSV hdf5 feather parquet read 260 sec 204/27.2 sec 85/131 sec 29.9/162 sec write 822 sec 59.8 sec 22.8 sec 26.3 sec data type lost ✔kept ✔kept ✔kept row index not applicable no constraint default default
  • 15. Better Model or Fast Iteration Better Model Faster Iteration More computing resources and time More repetitions Better results, prospect of prize $$ Debugging algorithm and model Less likely overfitting Overfitting Final stage Early stage Tips: Increasing the training data and model complexity progressively in model building process.
  • 16. Take-homes ● Memory management ○ Tools ○ Choose minimal data types ○ Garbage collection after removing obsolete variables ● Time management ○ Tools ○ Break down tasks ○ Store binary intermediate results ○ Trade-off between better models and faster iteration
  • 17. Thanks for Your Attention! ● LinkedIn: linkedin.com/in/enfeizhan/ ● GitHub: github.com/enfeizhan ● Kaggle: kaggle.com/enfeizhan ● Email: enfeizhan@gmail.com

Editor's Notes

  1. Thanks, Enrique! And thank you for changing your plan to host tonight. Thank you very much for coming! 0 Tonight I would like to share with you my lessons after surviving a recent kaggle competition.
  2. A bit about myself: You probably have seen it from the meet-up page. I studied theoretical physics and have been working as data scientist/analyst for more three years and a half. Recently just started at Servian as a consultant. Servian is a consultancy agent, which helps clients with data analytics, cloud technologies and so on. For those who are job hunting there would be a bonus in the end. So please stay on even you think the contents are boring :) You are welcomed to get connected through each one of these.
  3. Before we get into business, viewer discretion if it’s relevant to you. You will see in this talk I use python pandas library and run calculations in jupyter notebook. That’s what I think a python logo should be.
  4. Speaking of what big data is, the definition that convinces the most is this three V’s. For today’s lessons, we’ll be focusing on volume. For a big cloud cluster, the data size today is just crumbs. But for an even pretty good home PC, it’s still a challenge. And that’s why I learned these lessons.
  5. First I would briefly walk you through the goal of the kaggle competition. And long story short this would be about managing memory and iterating more efficiently.
  6. In this kaggle competition, you are challenged to predict whether a Chinese mobile user would install an advertised mobile app after clicking on an ad. Here is a probably not the best example. Found this one after browsing an hour on my phone. Here you can see this is an ad for a game app on a news page. Your click on this link leads to the app downloading page. What you just did would be recorded, which includes your ip, this app, your device (in my case an iphone 6), the operating system (in my case ios 11.4), the channel, which is this news site, the click time. And if you did download, your downloading time and 1 for is_attributed. So the data fall into the category of time series. For training, you are provided with 3, 4 days of data and predict this is_attributed for the following day.
  7. We got about 7 Gig for training set and 800 meg test set. Below is the resources on my PC, about 78 gig in total including the swap. You might say that’s not too bad. But trust me this is similar to play russian tetris. Though setting out you have seemingly a lot of room, if you aren’t cautious about memory usage, first you would be using you swap, which is terribly slower and even worse the total memory drains away completely quickly.
  8. First you would have to watch how your code is memory hungry in real time to figure out what steps deplete your ram and what steps should be further improved to be able to run follow-up steps. For this I use the linux utility free which can print out the memory usage in real time, given the correct flags and parameters, which is a bit tricky. I’ll show you in the next slide.
  9. You can set time interval at which you want free to print out the memory usage. Unfortunately you need a bit more in order to make that happen. I’ll play this video. See you can set a big number which is the total number of counts. And these two have to be correct order.
  10. Next you gotta minimise your memory use as much as possible. The first thing, when you load the dataset, you should do is choose the least data type for your data. Not surprisingly, the default data type would be the most memory consuming one. As you can see in this table, 64 bits is default for integers. And the timestamps are treated as strings, which also claim more memory than actually necessary. Even at a glance you can find the memory cost is dramatically reduced by choosing the data type that is just enough for the data. For example no negative occurrences so we better off just use unsigned integer types and must parse the timestamps to save a lot of memory. A side note: to get these info you can do these two if you use pandas for analysis.
  11. While choosing best data type saves a lot of ram at the set-out, ram is still frequently at risk at run time for a situation like this we have. Again, russian tetris gives you hints of how to solve this problem as in both scenarios you are challenged with limited real estate. As it is well-known that the success of playing tetris depends on how efficiently blocks are removed. Similarly, we remove variables once they become obsolete to release the memory they occupied. Generally this is enough, python would automatically release the memory after you deleted the variable. However, you do need to manually claim back the memory by calling the collect function from the gc module. To be honest, I didn’t know this before this competition. BTW, don’t get confused with gc for garbage collection. Turns out it stands for “generational cycle”, where two variables reference to each other. And that’s the core of reason where memory doesn’t get released by python. Because python got confused by the cyclic reference and is reluctant to reclaim the memory. You have to call gc module for that. In practice, I even went as far as removing obsolete variables in each step of a for loop as well as garbage collection, which helps phenomenally.
  12. All of the above was about managing memories. Now we discuss how to save time in building models. First thing up, you would have to know how much time your calculation costs. For this purpose Jupyter got the magic command %%time, which times all the calculations in a cell. Let’s see the demo in the video.
  13. With the magic command at our disposal, I discovered pitfalls we wouldn’t have imagined of. Like this one. I want to see a high level description of the data. Presumably the time taken is roughly the sum of time taken for each column. However, for this example, the time taken for two columns separately is shorter than them together. My guess would be something about available memory to carry out the calculation or paging sort of stuff. Key take-away: think twice before running a big chunk of calculation. And split the calculation and monitor the memory taken.
  14. As discovered in many surveys among data scientists, most of our time has been spent on feature engineering and data clearning, where there would be lots of intermediate data. Very likely you need to store them for later use. You can store them in csv files. That’s ok. I am not against that. But you deserve something better. Because storing in csv loses data types like datetime and you have to parse it next time you reload it. Exactly like McKinney said in one of his tweets. BTW he is the author of pandas. In addition to data type, reading and writing csv files is much slower than other format listed here in the table. For comparison, I read and write a same file. These are only supposed to give you a general idea how different they are. As we can writing with these three formats are much faster. Reading is interesting. While hdf5 takes more time to read for the first time in a session, these two take less. Personally I like to use hdf5 though it doesn’t look the best but other two format gots an annoying constraint. That is the row index of a dataframe has to be from 0 to N while it could be anything of hdf5. This comes in handy when you want to carry information with the index.
  15. Lastly I wanna touch a bit on the trade-off between a better model and a quicker model. This may not seem important for a small dataset but have a big impact on your work when it’s bigger. The key reason here is repeatedly trying different features, algorithms, and models is inevitable. Taking too much time on each iteration won’t meet the deadline for a competition or project. In most cases, consider using less data to train a less performant model and gain insights from there would be benificial at the start of the project.
  16. That’s all I want to say. You can take home some hands-on experiences about managing memory and time more efficiently.
  17. Thank you for your attention! And again feel free to contact me on any of these methods.
  18. One last thing, servian is hiring! Talk to me if you are looking.