SlideShare a Scribd company logo
© Cloudera, Inc. All rights reserved.
DISCOVERING EXOPLANETS
WITH DEEP LEARNING & CDSW
Rafael Arana – Senior Solutions Architect
© Cloudera, Inc. All rights reserved. 2© Cloudera, Inc. All rights reserved.
“WE ARE JUST AN ADVANCED BREED OF
MONKEYS ON A MINOR PLANET OF A
VERY AVERAGE STAR.
BUT WE CAN UNDERSTAND THE
UNIVERSE. THAT MAKE US SOMETHING
VERY SPECIAL”
Stephen Hawking
© Cloudera, Inc. All rights reserved. 3© Cloudera, Inc. All rights reserved.
DISCLAIMER #1
https://github.com/google-research/exoplanet-ml
© Cloudera, Inc. All rights reserved. 4© Cloudera, Inc. All rights reserved.
DISCLAIMER #2
I’m not a Data Scientist
© Cloudera, Inc. All rights reserved. 5© Cloudera, Inc. All rights reserved.
KEPLER
NASA’s First Mission Capable of Finding Earth-Size Planets
© Cloudera, Inc. All rights reserved. 6© Cloudera, Inc. All rights reserved.
KEPLER DATA SET
150K 35K
3735
Possible planetary signals
Confirmed planets
Stars
614Multi Planet systems
© Cloudera, Inc. All rights reserved. 7© Cloudera, Inc. All rights reserved.
THE KEPLER DATA SET
Threshold Crossing Events (TCEs)
© Cloudera, Inc. All rights reserved. 8© Cloudera, Inc. All rights reserved.
DATA PREPARATION & FEATURE ENGINEERING
© Cloudera, Inc. All rights reserved. 9© Cloudera, Inc. All rights reserved.
NORMALIZE YOUR INPUT DATA
© Cloudera, Inc. All rights reserved. 10© Cloudera, Inc. All rights reserved.
NORMALIZE YOUR INPUT DATA
.
© Cloudera, Inc. All rights reserved. 11© Cloudera, Inc. All rights reserved.
NORMALIZE YOUR INPUT DATA
After diving the flux by the median per segment
• Diving the flux by the
median per segment
• Normalize to 1
© Cloudera, Inc. All rights reserved. 12© Cloudera, Inc. All rights reserved.
SCRUBBING
“Fix" bad examples by removing them from the data set
• We've assumed that all the data used for training and testing was trustworthy.
• In real-life, many examples in data sets are unreliable due to one or more of
the following minimize the cross-entropy error function over the training set
• Omitted values.
• Duplicate examples. For example, a server mistakenly uploaded the same logs twice.
• Bad labels. For instance, an astronomer mislabeled an event as a planet
• Bad feature values. For example, someone typed in an extra digit
© Cloudera, Inc. All rights reserved. 13© Cloudera, Inc. All rights reserved.
OUTLIERS DATA POINTS
Remove all points over 3 times the deviation
© Cloudera, Inc. All rights reserved. 14© Cloudera, Inc. All rights reserved.
BINNING, FOLD & SPLINE
Removing the noise
© Cloudera, Inc. All rights reserved. 15© Cloudera, Inc. All rights reserved.
SPLITTING THE DATA SET
Training, test and validation
© Cloudera, Inc. All rights reserved. 16© Cloudera, Inc. All rights reserved.
SPLITTING THE DATA SET
Training, test and validation
© Cloudera, Inc. All rights reserved. 17© Cloudera, Inc. All rights reserved.
STANDARD TENSOR FLOWFILE FORMAT
TFRecords
• The recommended format for TensorFlow is a TFRecords file containing
tf.train.Example protocol buffers (which contain Features as a field).
• Optimized for large datasets
• It reads in memory only the data required for each batch
© Cloudera, Inc. All rights reserved. 18© Cloudera, Inc. All rights reserved.
MODELING
© Cloudera, Inc. All rights reserved. 19© Cloudera, Inc. All rights reserved.
MODELING
1st approach - Fully connected neural network (FCC)
© Cloudera, Inc. All rights reserved. 20© Cloudera, Inc. All rights reserved.
MODELING
2nd approach - Convolutional neural network (CNN)
© Cloudera, Inc. All rights reserved. 21© Cloudera, Inc. All rights reserved.
MODELING
Combining the two sets of input features
© Cloudera, Inc. All rights reserved. 22© Cloudera, Inc. All rights reserved.
ARCHITECTURE
• Naming conventions:
• Convolutional layers
• conv [kernel size]n‐[number of feature map]̃
• Max pooling layers
• maxpool [window length]n‐[stride length]
• Fully connected layers
• FC-[number of units]
© Cloudera, Inc. All rights reserved. 23© Cloudera, Inc. All rights reserved.
EVALUATION – NETWORK PERFORMANCE
© Cloudera, Inc. All rights reserved. 24© Cloudera, Inc. All rights reserved.
MODEL ANALYSIS
Metrics to assess our model’s performance.
• Precision: the fraction of signals classified as planets that are true planets
(also known as reliability; see, e.g., S. E. Thompson et al. 2017, in preparation).
• Recall: the fraction of true planets that are classified as planets (also known
as completeness).
• Accuracy: the fraction of correct classifications.
• AUC: the area under the receiver operating characteristic curve, which is
equivalent to the probability that a randomly selected planet is scored higher
than a randomly selected false positive.
© Cloudera, Inc. All rights reserved. 25© Cloudera, Inc. All rights reserved.
METRICS
Assessing models performance
• Precision:
• fraction of signals classified as planets that are
true.
• Recall:
• the fraction of true planets that are classified as
planets (also known as completeness).
• Accuracy:
• the fraction of correct classifications.
• AUC:
• the area under the receiver operating characteristic
curve, which is equivalent to the probability that a
randomly selected planet is scored higher than a
randomly selected false positive.
© Cloudera, Inc. All rights reserved. 26© Cloudera, Inc. All rights reserved.
TENSORBOARD
Assessing models performance
• From a Terminal run:
• tensorboard --port 8080 --logdir /home/cdsw/model_checkpoint
© Cloudera, Inc. All rights reserved. 27© Cloudera, Inc. All rights reserved.
TENSORFLOW CHECKPOINTS
• Critical when you start training larger again
• They allow you to continue training, resume on failure, and predict from a train
model.
• Save: Specify a folder, when you instantiate the model and checkpoints will be
saved there periodically.
• Restore: Specify a folder when you instantiated, if a checkpoint is found there
it is loaded, and the estimator is ready for predictions.
• If you want to restart from scratch, just delete this folder
© Cloudera, Inc. All rights reserved. 29© Cloudera, Inc. All rights reserved.
NETWORK OPTIMIZATION
© Cloudera, Inc. All rights reserved. 30© Cloudera, Inc. All rights reserved.
OPTIMIZATION TECHNIQUES
• Adam optimization algorithm
• minimize the cross-entropy error function over the training set
• Data augmentation
• We augmented our training data by applying random horizontal reflections to the light
curves during training
• Dropout regularization to the fully connected layers
• which helps prevent over fitting by randomly “dropping” some of the output neurons from
each layer during training to prevent the model from becoming overly reliant on any of its
features
© Cloudera, Inc. All rights reserved. 31© Cloudera, Inc. All rights reserved.
TENSORFLOW OPTIMIZERS
• Implementations
• tf.train.MomentumOptimizer( momentum, use nesterov)
• tf.train.GradientDescentOptimizer( learning rate )
• tf.train.AdagradOptimizer (learning rate )
• tf.train.AdamOptimizer (learning rate)
• tfRMSPropOptimizer: learning rate
• TPU:
• tf.contrib.tpu.CrossShardOptimizer(optimizer)
• Clip Gradient Norms:
• tf.contrib.training.clip_gradient_norms_fn(max_norm)
© Cloudera, Inc. All rights reserved. 32© Cloudera, Inc. All rights reserved.
DROPOUT
Techniques to prevent over fitting
© Cloudera, Inc. All rights reserved. 33© Cloudera, Inc. All rights reserved.
DROPOUT
• Adjust dropout per layers
• Initial Layers normally have more hidden units
• The more hidden units you have more over fitting
• Apply more dropout
• Can be applied on the Conv Layers or the FC layers
• Don’t use it during evaluation (test set)
• We want predictability
• Downside:
• The noise of dropout avoid that the Cost Function (J) decrease in every step.
• Healthcheck: Disable dropout, check it drops constantly, enable again
© Cloudera, Inc. All rights reserved. 34© Cloudera, Inc. All rights reserved.
GOOGLE-VIZIER
A Google internal Service for Black-Box Optimization
• Automatically tune the hyperparameters
• input representations (e.g., number of bins, bin width)
• model architecture (e.g., number of fully connected layers, number
of convolutional layers, kernel size)
• and training (e.g., dropout probability).
• Each Vizier “study” trained several thousand models to find the
hyperparameter
• Each model was trained on a single central processing unit (CPU)
• Used 100 CPUs per study to train individual models in parallel
© Cloudera, Inc. All rights reserved. 35© Cloudera, Inc. All rights reserved.
© Cloudera, Inc. All rights reserved. 36© Cloudera, Inc. All rights reserved.
KEPLER 90
The star known to host the most planets
https://www.nasa.gov/image-feature/ames/kepler-90-system-planet-sizes
© Cloudera, Inc. All rights reserved. 37© Cloudera, Inc. All rights reserved.
DEMO TIME
© Cloudera, Inc. All rights reserved.
THANK YOU

More Related Content

What's hot

Champion Fas Deduplication
Champion Fas DeduplicationChampion Fas Deduplication
Champion Fas DeduplicationMichael Hudak
 
Senlin Clustering Service Deep Dive
Senlin Clustering Service Deep DiveSenlin Clustering Service Deep Dive
Senlin Clustering Service Deep DiveEthan Lynn
 
HFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolHFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolMatteo Dell'Amico
 
80a disaster recovery
80a disaster recovery80a disaster recovery
80a disaster recoverymapr-academy
 
Veritas NetBackup Scalability and Performance Benchmark Infographic
Veritas NetBackup Scalability and Performance Benchmark InfographicVeritas NetBackup Scalability and Performance Benchmark Infographic
Veritas NetBackup Scalability and Performance Benchmark InfographicVeritas Technologies LLC
 
LCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLinaro
 

What's hot (6)

Champion Fas Deduplication
Champion Fas DeduplicationChampion Fas Deduplication
Champion Fas Deduplication
 
Senlin Clustering Service Deep Dive
Senlin Clustering Service Deep DiveSenlin Clustering Service Deep Dive
Senlin Clustering Service Deep Dive
 
HFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn ProtocolHFSP: the Hadoop Fair Sojourn Protocol
HFSP: the Hadoop Fair Sojourn Protocol
 
80a disaster recovery
80a disaster recovery80a disaster recovery
80a disaster recovery
 
Veritas NetBackup Scalability and Performance Benchmark Infographic
Veritas NetBackup Scalability and Performance Benchmark InfographicVeritas NetBackup Scalability and Performance Benchmark Infographic
Veritas NetBackup Scalability and Performance Benchmark Infographic
 
LCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoC
 

Similar to Discovering exoplanets with Deep Leaning

Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupJeff Holoman
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang
 
Which cloud(s) & why? Defining Clouds and Best Practices
Which cloud(s) & why? Defining Clouds and Best PracticesWhich cloud(s) & why? Defining Clouds and Best Practices
Which cloud(s) & why? Defining Clouds and Best PracticesPaul Weiss
 
Parallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWParallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWDataWorks Summit
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2DataWorks Summit
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitRafael Arana
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Cloudera, Inc.
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017Cloudera Japan
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Cloudera, Inc.
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...DataStax
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data PlatformRakuten Group, Inc.
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospectc-bslim
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Amazon Web Services
 
An Introduction to Kafka Cruise Control with Viktor Somogyi-Vass
An Introduction to Kafka Cruise Control with Viktor Somogyi-VassAn Introduction to Kafka Cruise Control with Viktor Somogyi-Vass
An Introduction to Kafka Cruise Control with Viktor Somogyi-VassHostedbyConfluent
 
New Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael MarthNew Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael MarthAEM HUB
 

Similar to Discovering exoplanets with Deep Leaning (20)

Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User Group
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Which cloud(s) & why? Defining Clouds and Best Practices
Which cloud(s) & why? Defining Clouds and Best PracticesWhich cloud(s) & why? Defining Clouds and Best Practices
Which cloud(s) & why? Defining Clouds and Best Practices
 
Parallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWParallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSW
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks Summit
 
Spark etl
Spark etlSpark etl
Spark etl
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
An Introduction to Kafka Cruise Control with Viktor Somogyi-Vass
An Introduction to Kafka Cruise Control with Viktor Somogyi-VassAn Introduction to Kafka Cruise Control with Viktor Somogyi-Vass
An Introduction to Kafka Cruise Control with Viktor Somogyi-Vass
 
New Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael MarthNew Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael Marth
 

Recently uploaded

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单ewymefz
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhArpitMalhotra16
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单ocavb
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportSatyamNeelmani2
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单nscud
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Domenico Conte
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationBoston Institute of Analytics
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单vcaxypu
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单yhkoc
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单ewymefz
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxzahraomer517
 

Recently uploaded (20)

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 

Discovering exoplanets with Deep Leaning

  • 1. © Cloudera, Inc. All rights reserved. DISCOVERING EXOPLANETS WITH DEEP LEARNING & CDSW Rafael Arana – Senior Solutions Architect
  • 2. © Cloudera, Inc. All rights reserved. 2© Cloudera, Inc. All rights reserved. “WE ARE JUST AN ADVANCED BREED OF MONKEYS ON A MINOR PLANET OF A VERY AVERAGE STAR. BUT WE CAN UNDERSTAND THE UNIVERSE. THAT MAKE US SOMETHING VERY SPECIAL” Stephen Hawking
  • 3. © Cloudera, Inc. All rights reserved. 3© Cloudera, Inc. All rights reserved. DISCLAIMER #1 https://github.com/google-research/exoplanet-ml
  • 4. © Cloudera, Inc. All rights reserved. 4© Cloudera, Inc. All rights reserved. DISCLAIMER #2 I’m not a Data Scientist
  • 5. © Cloudera, Inc. All rights reserved. 5© Cloudera, Inc. All rights reserved. KEPLER NASA’s First Mission Capable of Finding Earth-Size Planets
  • 6. © Cloudera, Inc. All rights reserved. 6© Cloudera, Inc. All rights reserved. KEPLER DATA SET 150K 35K 3735 Possible planetary signals Confirmed planets Stars 614Multi Planet systems
  • 7. © Cloudera, Inc. All rights reserved. 7© Cloudera, Inc. All rights reserved. THE KEPLER DATA SET Threshold Crossing Events (TCEs)
  • 8. © Cloudera, Inc. All rights reserved. 8© Cloudera, Inc. All rights reserved. DATA PREPARATION & FEATURE ENGINEERING
  • 9. © Cloudera, Inc. All rights reserved. 9© Cloudera, Inc. All rights reserved. NORMALIZE YOUR INPUT DATA
  • 10. © Cloudera, Inc. All rights reserved. 10© Cloudera, Inc. All rights reserved. NORMALIZE YOUR INPUT DATA .
  • 11. © Cloudera, Inc. All rights reserved. 11© Cloudera, Inc. All rights reserved. NORMALIZE YOUR INPUT DATA After diving the flux by the median per segment • Diving the flux by the median per segment • Normalize to 1
  • 12. © Cloudera, Inc. All rights reserved. 12© Cloudera, Inc. All rights reserved. SCRUBBING “Fix" bad examples by removing them from the data set • We've assumed that all the data used for training and testing was trustworthy. • In real-life, many examples in data sets are unreliable due to one or more of the following minimize the cross-entropy error function over the training set • Omitted values. • Duplicate examples. For example, a server mistakenly uploaded the same logs twice. • Bad labels. For instance, an astronomer mislabeled an event as a planet • Bad feature values. For example, someone typed in an extra digit
  • 13. © Cloudera, Inc. All rights reserved. 13© Cloudera, Inc. All rights reserved. OUTLIERS DATA POINTS Remove all points over 3 times the deviation
  • 14. © Cloudera, Inc. All rights reserved. 14© Cloudera, Inc. All rights reserved. BINNING, FOLD & SPLINE Removing the noise
  • 15. © Cloudera, Inc. All rights reserved. 15© Cloudera, Inc. All rights reserved. SPLITTING THE DATA SET Training, test and validation
  • 16. © Cloudera, Inc. All rights reserved. 16© Cloudera, Inc. All rights reserved. SPLITTING THE DATA SET Training, test and validation
  • 17. © Cloudera, Inc. All rights reserved. 17© Cloudera, Inc. All rights reserved. STANDARD TENSOR FLOWFILE FORMAT TFRecords • The recommended format for TensorFlow is a TFRecords file containing tf.train.Example protocol buffers (which contain Features as a field). • Optimized for large datasets • It reads in memory only the data required for each batch
  • 18. © Cloudera, Inc. All rights reserved. 18© Cloudera, Inc. All rights reserved. MODELING
  • 19. © Cloudera, Inc. All rights reserved. 19© Cloudera, Inc. All rights reserved. MODELING 1st approach - Fully connected neural network (FCC)
  • 20. © Cloudera, Inc. All rights reserved. 20© Cloudera, Inc. All rights reserved. MODELING 2nd approach - Convolutional neural network (CNN)
  • 21. © Cloudera, Inc. All rights reserved. 21© Cloudera, Inc. All rights reserved. MODELING Combining the two sets of input features
  • 22. © Cloudera, Inc. All rights reserved. 22© Cloudera, Inc. All rights reserved. ARCHITECTURE • Naming conventions: • Convolutional layers • conv [kernel size]n‐[number of feature map]̃ • Max pooling layers • maxpool [window length]n‐[stride length] • Fully connected layers • FC-[number of units]
  • 23. © Cloudera, Inc. All rights reserved. 23© Cloudera, Inc. All rights reserved. EVALUATION – NETWORK PERFORMANCE
  • 24. © Cloudera, Inc. All rights reserved. 24© Cloudera, Inc. All rights reserved. MODEL ANALYSIS Metrics to assess our model’s performance. • Precision: the fraction of signals classified as planets that are true planets (also known as reliability; see, e.g., S. E. Thompson et al. 2017, in preparation). • Recall: the fraction of true planets that are classified as planets (also known as completeness). • Accuracy: the fraction of correct classifications. • AUC: the area under the receiver operating characteristic curve, which is equivalent to the probability that a randomly selected planet is scored higher than a randomly selected false positive.
  • 25. © Cloudera, Inc. All rights reserved. 25© Cloudera, Inc. All rights reserved. METRICS Assessing models performance • Precision: • fraction of signals classified as planets that are true. • Recall: • the fraction of true planets that are classified as planets (also known as completeness). • Accuracy: • the fraction of correct classifications. • AUC: • the area under the receiver operating characteristic curve, which is equivalent to the probability that a randomly selected planet is scored higher than a randomly selected false positive.
  • 26. © Cloudera, Inc. All rights reserved. 26© Cloudera, Inc. All rights reserved. TENSORBOARD Assessing models performance • From a Terminal run: • tensorboard --port 8080 --logdir /home/cdsw/model_checkpoint
  • 27. © Cloudera, Inc. All rights reserved. 27© Cloudera, Inc. All rights reserved. TENSORFLOW CHECKPOINTS • Critical when you start training larger again • They allow you to continue training, resume on failure, and predict from a train model. • Save: Specify a folder, when you instantiate the model and checkpoints will be saved there periodically. • Restore: Specify a folder when you instantiated, if a checkpoint is found there it is loaded, and the estimator is ready for predictions. • If you want to restart from scratch, just delete this folder
  • 28. © Cloudera, Inc. All rights reserved. 29© Cloudera, Inc. All rights reserved. NETWORK OPTIMIZATION
  • 29. © Cloudera, Inc. All rights reserved. 30© Cloudera, Inc. All rights reserved. OPTIMIZATION TECHNIQUES • Adam optimization algorithm • minimize the cross-entropy error function over the training set • Data augmentation • We augmented our training data by applying random horizontal reflections to the light curves during training • Dropout regularization to the fully connected layers • which helps prevent over fitting by randomly “dropping” some of the output neurons from each layer during training to prevent the model from becoming overly reliant on any of its features
  • 30. © Cloudera, Inc. All rights reserved. 31© Cloudera, Inc. All rights reserved. TENSORFLOW OPTIMIZERS • Implementations • tf.train.MomentumOptimizer( momentum, use nesterov) • tf.train.GradientDescentOptimizer( learning rate ) • tf.train.AdagradOptimizer (learning rate ) • tf.train.AdamOptimizer (learning rate) • tfRMSPropOptimizer: learning rate • TPU: • tf.contrib.tpu.CrossShardOptimizer(optimizer) • Clip Gradient Norms: • tf.contrib.training.clip_gradient_norms_fn(max_norm)
  • 31. © Cloudera, Inc. All rights reserved. 32© Cloudera, Inc. All rights reserved. DROPOUT Techniques to prevent over fitting
  • 32. © Cloudera, Inc. All rights reserved. 33© Cloudera, Inc. All rights reserved. DROPOUT • Adjust dropout per layers • Initial Layers normally have more hidden units • The more hidden units you have more over fitting • Apply more dropout • Can be applied on the Conv Layers or the FC layers • Don’t use it during evaluation (test set) • We want predictability • Downside: • The noise of dropout avoid that the Cost Function (J) decrease in every step. • Healthcheck: Disable dropout, check it drops constantly, enable again
  • 33. © Cloudera, Inc. All rights reserved. 34© Cloudera, Inc. All rights reserved. GOOGLE-VIZIER A Google internal Service for Black-Box Optimization • Automatically tune the hyperparameters • input representations (e.g., number of bins, bin width) • model architecture (e.g., number of fully connected layers, number of convolutional layers, kernel size) • and training (e.g., dropout probability). • Each Vizier “study” trained several thousand models to find the hyperparameter • Each model was trained on a single central processing unit (CPU) • Used 100 CPUs per study to train individual models in parallel
  • 34. © Cloudera, Inc. All rights reserved. 35© Cloudera, Inc. All rights reserved.
  • 35. © Cloudera, Inc. All rights reserved. 36© Cloudera, Inc. All rights reserved. KEPLER 90 The star known to host the most planets https://www.nasa.gov/image-feature/ames/kepler-90-system-planet-sizes
  • 36. © Cloudera, Inc. All rights reserved. 37© Cloudera, Inc. All rights reserved. DEMO TIME
  • 37. © Cloudera, Inc. All rights reserved. THANK YOU