This document discusses using Python to analyze LHC data from CERN's Higgs Boson machine learning challenge on Kaggle. It introduces ROOT and TMVA tools for working with particle physics data in C++ and Python. It also discusses SciPy and Scikit-Learn libraries that can be used for tasks like data preprocessing, machine learning algorithms, and visualization.
USING ORFEO TOOLBOX A GROWING COMPETENCE IN A COLLABORATIVE ENVIRONMENTotb
various uses : training set for MEDDE and CEREMA users, integration in a processing chain (OTB, ogr & gdal application), thematic (land cover for city planning, coastline monitoring, hasards flood), Dominique HEBRARD
Deck used for my talk during PyDataNYC in which I described how we improved thumbnail cropping in our news app, Kamelio. We used Deep Learning object detection to identify the interesting regions of the image which was subsequently fed into image cropping logic.
USING ORFEO TOOLBOX A GROWING COMPETENCE IN A COLLABORATIVE ENVIRONMENTotb
various uses : training set for MEDDE and CEREMA users, integration in a processing chain (OTB, ogr & gdal application), thematic (land cover for city planning, coastline monitoring, hasards flood), Dominique HEBRARD
Deck used for my talk during PyDataNYC in which I described how we improved thumbnail cropping in our news app, Kamelio. We used Deep Learning object detection to identify the interesting regions of the image which was subsequently fed into image cropping logic.
By Tobias Grosser, Scalable Parallel Computing Laboratory
The COSMO climate and weather model delivers daily forecasts for Switzerland and many other nations. As a traditional HPC application it was developed with SIMD-CPUs in mind and large manual efforts were required to enable the 2016 move to GPU acceleration. As today's high-performance computer systems increasingly rely on accelerators to reach peak performance and manual translation to accelerators is both costly and difficult to maintain, we propose a fully automatic accelerator compiler for the automatic translation of scientific Fortran codes to CUDA GPU accelerated systems. Several challenges had to be overcome to make this reality: 1) improved scalability, 2) automatic data placement using unified memory, 3) loop rescheduling to expose coarse-grained parallelism, 4) inter-procedural loop optimization, and 5) plenty of performance tuning. Our evaluation shows that end-to-end automatic accelerator compilation is possible for non-trivial portions of the COSMO climate model, despite the lack of complete static information. Non-trivial loop optimizations previously implemented manually are performed fully automatically and memory management happens fully transparently using unified memory. Our preliminary results show notable performance improvements over sequential CPU code (40s to 8s reduction in execution time) and we are currently working on closing the remaining gap to hand-tuned GPU code. This talk is a status update on our most recent efforts and also intended to gather feedback on future research plans towards automatically mapping COSMO to FPGAs.
Tobias Grosser Bio
Tobias Grosser is a senior researcher in the Scalable Parallel Computing Laboratory (SPCL) of Torsten Hoefler at the Computer Science Department of ETH Zürich. Supported by a Google PhD Fellowship he received his doctoral degree from Universite Pierre et Marie Curie under the supervision of Albert Cohen. Tobias' research is taking place at the border of low-level compilers and high-level program transformations with the goal of enabling complex - but highly-beneficial - program transformations in a production compiler environment. He develops with the Polly loop optimizer a loop transformation framework which today is a community project supported throught the Polly Labs research laboratory. Tobias also developed advanced tiling schemes for the efficient execution of iterated stencils. Today Tobias leads the heterogeneous compute efforts in the Swiss University funded ComPASC project and is about to start a three year NSF Ambizione project on advancing automatic compilation and heterogenization techniques at ETH Zurich.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Landuse Classification from Satellite Imagery using Deep LearningDataWorks Summit
With the abundance of remote sensing satellite imagery, the possibilities are endless as to the kind of insights that can be derived from them. One such use is to determine land use for agriculture and non-agricultural purposes.
In this talk, we’ll be looking at leveraging Sentinel-2 satellite imagery data along with OpenStreetMap labels to be able to classify land use as agricultural or non-agricultural.
Sentinel-2 data has a 10-meter resolution in RGB bands and is well-suited for land use classification. Using these two datasets, many different machine learning tasks can be performed like image segmentation into two classes (farm land and non-farm land) or more challenging task of identification of crop type being cultivated on fields.
For this talk, we’ll be looking at leveraging convolutional neural networks (CNNs) built with Apache MXNet to train deep learning models for land use classification. We’ll be covering the different deep learning architectures considered for this particular use case along with the appropriate metrics.
We’ll be leveraging streaming pipelines built on Apache Flink and Apache NiFi for model training and inference. Developers will come away with a better understanding of how to analyze satellite imagery and the different deep learning architectures along with their pros/cons when analyzing satellite imagery for land use. SUNEEL MARTHI and CHRIS OLIVIER, Software Development Engineer Amazon Web Services
AIC x PyLadies TW Python Data Vis - 2: Plot packagesYi-Chih Tsai
This slide is part of Python Data Visualization series event held by AIC x PyLadies TW.
Part 2: Python plot packages, e.g., matplotlib, seaborn, plotly, bokeh
Overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0Takahiro Katagiri
This is a material for overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0, which is numerical simulation software of a seismic wave analysis with function of automatic performance tuning (AT). Project of ppOpen-HPC is developing and supporting for this software. The effect of AT is shown with respect to several recent computer environments, such as multi-core (Ivy Bridge) and many-core (Xeon Phi).
By Tobias Grosser, Scalable Parallel Computing Laboratory
The COSMO climate and weather model delivers daily forecasts for Switzerland and many other nations. As a traditional HPC application it was developed with SIMD-CPUs in mind and large manual efforts were required to enable the 2016 move to GPU acceleration. As today's high-performance computer systems increasingly rely on accelerators to reach peak performance and manual translation to accelerators is both costly and difficult to maintain, we propose a fully automatic accelerator compiler for the automatic translation of scientific Fortran codes to CUDA GPU accelerated systems. Several challenges had to be overcome to make this reality: 1) improved scalability, 2) automatic data placement using unified memory, 3) loop rescheduling to expose coarse-grained parallelism, 4) inter-procedural loop optimization, and 5) plenty of performance tuning. Our evaluation shows that end-to-end automatic accelerator compilation is possible for non-trivial portions of the COSMO climate model, despite the lack of complete static information. Non-trivial loop optimizations previously implemented manually are performed fully automatically and memory management happens fully transparently using unified memory. Our preliminary results show notable performance improvements over sequential CPU code (40s to 8s reduction in execution time) and we are currently working on closing the remaining gap to hand-tuned GPU code. This talk is a status update on our most recent efforts and also intended to gather feedback on future research plans towards automatically mapping COSMO to FPGAs.
Tobias Grosser Bio
Tobias Grosser is a senior researcher in the Scalable Parallel Computing Laboratory (SPCL) of Torsten Hoefler at the Computer Science Department of ETH Zürich. Supported by a Google PhD Fellowship he received his doctoral degree from Universite Pierre et Marie Curie under the supervision of Albert Cohen. Tobias' research is taking place at the border of low-level compilers and high-level program transformations with the goal of enabling complex - but highly-beneficial - program transformations in a production compiler environment. He develops with the Polly loop optimizer a loop transformation framework which today is a community project supported throught the Polly Labs research laboratory. Tobias also developed advanced tiling schemes for the efficient execution of iterated stencils. Today Tobias leads the heterogeneous compute efforts in the Swiss University funded ComPASC project and is about to start a three year NSF Ambizione project on advancing automatic compilation and heterogenization techniques at ETH Zurich.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Landuse Classification from Satellite Imagery using Deep LearningDataWorks Summit
With the abundance of remote sensing satellite imagery, the possibilities are endless as to the kind of insights that can be derived from them. One such use is to determine land use for agriculture and non-agricultural purposes.
In this talk, we’ll be looking at leveraging Sentinel-2 satellite imagery data along with OpenStreetMap labels to be able to classify land use as agricultural or non-agricultural.
Sentinel-2 data has a 10-meter resolution in RGB bands and is well-suited for land use classification. Using these two datasets, many different machine learning tasks can be performed like image segmentation into two classes (farm land and non-farm land) or more challenging task of identification of crop type being cultivated on fields.
For this talk, we’ll be looking at leveraging convolutional neural networks (CNNs) built with Apache MXNet to train deep learning models for land use classification. We’ll be covering the different deep learning architectures considered for this particular use case along with the appropriate metrics.
We’ll be leveraging streaming pipelines built on Apache Flink and Apache NiFi for model training and inference. Developers will come away with a better understanding of how to analyze satellite imagery and the different deep learning architectures along with their pros/cons when analyzing satellite imagery for land use. SUNEEL MARTHI and CHRIS OLIVIER, Software Development Engineer Amazon Web Services
AIC x PyLadies TW Python Data Vis - 2: Plot packagesYi-Chih Tsai
This slide is part of Python Data Visualization series event held by AIC x PyLadies TW.
Part 2: Python plot packages, e.g., matplotlib, seaborn, plotly, bokeh
Overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0Takahiro Katagiri
This is a material for overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0, which is numerical simulation software of a seismic wave analysis with function of automatic performance tuning (AT). Project of ppOpen-HPC is developing and supporting for this software. The effect of AT is shown with respect to several recent computer environments, such as multi-core (Ivy Bridge) and many-core (Xeon Phi).
A Numerical Method for the Evaluation of Kolmogorov Complexity, An alternativ...Hector Zenil
We present a novel alternative method (other than using compression algorithms) to approximate the algorithmic complexity of a string by calculating its algorithmic probability and applying Chaitin-Levin's coding theorem.
An introduction to coding using Python’s on-screen ‘turtle’ that can be commanded with a few simple instructions including forward, backward, left and right. The turtle leaves a trace that can be used to draw geometric figures.
Matplotlib has wonderfully served the Python community as the cornerstone of scientific graphics. Recently, many additional Python plotting options have surfaced, aimed to make it easier to create graphics that are interactive and web-publishable. These slides outline some of the new options with links to easy-to-follow, IPython notebooks.
An introductory talk on scientific computing in Python. Statistics, probability and linear algebra, are important aspects of computing/computer modeling and the same is covered here.
The slides shown here have been used for talks given to scientists in informal contexts.
Python is introduced as a valuable tool for both producing and evaluating data.
The talk is essentially a guided tour of the author's favourite parts of the Python ecosystem. Besides the Python language itself, NumPy and SciPy as well as Matplotlib are mentioned.
A last part of the talk concerns itself with code execution speed. With this problem in sight, Cython and f2py are introduced as means of glueing different languages together and speeding Python up.
The source code for the slides, code snippets and further links are available in a git repository at
https://github.com/aeberspaecher/PythonForScientists
Python standard library & list of important librariesgrinu
We know that a module is a file with some Python code, and a package is a directory for sub packages and modules. But the line between a package and a Python library is quite blurred.
A Python library is a reusable chunk of code that you may want to include in your programs/ projects. Compared to languages like C++ or C, a Python libraries do not pertain to any specific context in Python. Here, a ‘library’ loosely describes a collection of core modules. Essentially, then, a library is a collection of modules. A package is a library that can be installed using a package manager like rubygems or npm.
Python for Science and Engineering: a presentation to A*STAR and the Singapor...pythoncharmers
An introduction to Python in science and engineering.
The presentation was given by Dr Edward Schofield of Python Charmers (www.pythoncharmers.com) to A*STAR and the Singapore Computational Sciences Club in June 2011.
基礎科學研究連結跨領域的未來 The Link between Fundamental Science Research and the Future o...Yuan CHAO
藉由介紹自己的工作角色與研究內容,為高中生說明基礎科學研究的現況。並期許藉由跨界的合作,無論在業界或是研究工作上發揮所長。
By introduce the personal role in the fundamental research, the talk hopes to let high school students know about the international collaboration and to bring the cooperation between different domains.
淺嚐 LHCb 數據分析的滋味 Play around the LHCb Data on Kaggle with SK-Learn and MatPlotLibYuan CHAO
LHC實驗是現今粒子物理實驗的最先端,2012年所發現的希格斯粒子更是物理界的一大盛事。繼Atlas實驗在Kaggle公開Higgs挑戰之後,另一個LHC的LHCb實驗也將實驗數據搬上了Kaggle平台。本講題將簡介背後的實驗,並使用LHCb的數據以SciKit-Learn進行多維度數據分析與使用MatPlotLib視覺化。
Play around the LHCb Data on Kaggle with SK-Learn and MatPlotLib
A local virtual signer project, LINNE, is proposed several years ago. However, to process a huge amount of sound-bank data is big problem. Here we make use of the python tool lib., PyMIR and SciKit-Learn, to help us extract the necessary information that needed for a song synthesizer, ex. UTAU.
Some history about the Chinese font display under Linux with FreeType engine and the fracture problem on stroke-based composition fonts like the infamous MingLiu Dyna fonts.
19. 19
Big News on 2012/07/04Big News on 2012/07/04
Discovery of a New BosonDiscovery of a New Boson
with Masswith Mass ~125 GeV~125 GeV
CERN-HI-1207136_92
30. Unsupervised Learning
The Google Cat
@ ICML'12
Deep Learning
Trained on 16K cores
Done in 3 days
Over 10M YouTube
stills
http://arxiv.org/abs/1112.6209
37. 37
檢視檢視 KaggleKaggle 挑戰數據挑戰數據
Data files provided on the Kaggle website:Data files provided on the Kaggle website:
Training datasetTraining dataset
InIn CSVCSV formatformat
250000 events250000 events
ID +ID + 30 features30 features
WeightedWeighted events!!!events!!!
Class label: s, bClass label: s, b
Test datasetTest dataset
550000 events550000 events
Same formatSame format
random_submissionrandom_submission
Sample for evaluationSample for evaluation
AMS MetricAMS Metric
Python script for competition evaluation metricPython script for competition evaluation metric
https://www.kaggle.com/c/higgs-boson/data
38. 38
ROOTROOT
RROOTOOT OObject-bject-OOrientedriented TToolkitoolkit
Data Analysis toolData Analysis tool
Written in C++ (millions of lines)Written in C++ (millions of lines)
Open sourceOpen source
Integrated C++ interpreterIntegrated C++ interpreter
File formatsFile formats
I/O handling, graphics, plotting,I/O handling, graphics, plotting,
math, histogram binning, eventmath, histogram binning, event
display, geometric navigationdisplay, geometric navigation
Powerful fitting (RooFit) andPowerful fitting (RooFit) and
statistical (RooStats) packagesstatistical (RooStats) packages
In use by most of HEP experimentsIn use by most of HEP experiments
Standard tool for producing physicsStandard tool for producing physics
results at LHCresults at LHC
New tools for model creation andNew tools for model creation and
combinationscombinations
http://root.cern.ch/drupal/
39. 39
pyROOTpyROOT
RROOTOOT OObject-bject-OOrientedriented TToolkitoolkit
Python binding for ROOTPython binding for ROOT
就算你不是慣就算你不是慣 CC 也沒問題!也沒問題!
All the booking and plottingAll the booking and plotting
functions have correspondingfunctions have corresponding
python bindingspython bindings
You can also use the sameYou can also use the same
data structure as used to be in C++data structure as used to be in C++
http://root.cern.ch/drupal/
40. 40
TMVATMVA
Multi-variate analysis tool-kitMulti-variate analysis tool-kit
Based on supervised learningBased on supervised learning
Embedded in ROOTEmbedded in ROOT
Easy training and testingEasy training and testing
Providing various classifiersProviding various classifiers
Linear Discriminant (LD)Linear Discriminant (LD)
Artificial Neural Networks (NN)Artificial Neural Networks (NN)
Boosted Decision Trees (BDT)Boosted Decision Trees (BDT)
......
http://tmva.sourceforge.net
/
41. 41
pyTMVApyTMVA
Multi-variate analysis tool-kitMulti-variate analysis tool-kit
用用 PythonPython 也可以!也可以!
Providing various classifiersProviding various classifiers
Linear Discriminant (LD)Linear Discriminant (LD)
Artificial Neural Networks (NN)Artificial Neural Networks (NN)
Boosted Decision Trees (BDT)Boosted Decision Trees (BDT)
......
http://tmva.sourceforge.net
/
57. “Big data is like teenage sex: everyone talks
about it, nobody really knows how to do it,
everyone thinks everyone else is doing it, so
everyone claims they are doing it...”
- Dan Ariely (Duke)
58. 58
Installing ROOTInstalling ROOT
Get the ROOT binary for UbuntuGet the ROOT binary for Ubuntu
Go to here:Go to here:
http://sourceforge.net/projects/cernrootdebs/http://sourceforge.net/projects/cernrootdebs/
Download the i386/x86_64 package:Download the i386/x86_64 package:
Click on "Files" → "32bits!" → "root_5.32.00_i386.deb"Click on "Files" → "32bits!" → "root_5.32.00_i386.deb"
Open a terminalOpen a terminal
Type in the following commands:Type in the following commands:
$ cd Download/$ cd Download/
$ sudo dpkg -i root_5.32.00_i386.deb$ sudo dpkg -i root_5.32.00_i386.deb ← use your passwd!← use your passwd!
$ sudo apt-get install libssl0.9.8$ sudo apt-get install libssl0.9.8
$ sudo apt-get install libjpeg62$ sudo apt-get install libjpeg62
$ source /opt/root/bin/thisroot.sh$ source /opt/root/bin/thisroot.sh ← you can put in ~/.bashrc← you can put in ~/.bashrc
You can run root now:You can run root now:
$ root -l$ root -l ← " -l" means no splash window← " -l" means no splash window
root [0]root [0] TBrowser tTBrowser t ← make sure no error messages← make sure no error messages