A speech about organization of machine learning process in practice. Conceptual and technical aspects discussed. Introduction into Luigi framework. A short story about neural networks fitting in Flo - top-level mobile tracker of women health.
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
Understanding object model and bytecode is a crucial part in understanding an interpreted object-oriented language. Smalltalk, one of the oldest object-oriented programming languages, has a great object model and has been used bytecode and VM since 1970s. It is interesting to compare Smalltalk's and Python's object model and bytecode. Guido once said "I remember being surprised by its use of metaclasses (which is quite different from that in Python or Ruby!) when I read about them much later. " and "Smalltalk's bytecode was a bigger influence of Python's bytecode though." It is interesting to compare Smalltalk's and Python's metacalss and bytecode.
A short introduction to the more advanced python and programming in general. Intended for users that has already learned the basic coding skills but want to have a rapid tour of more in-depth capacities offered by Python and some general programming background.
Execrices are available at: https://github.com/chiffa/Intermediate_Python_programming
Introduction to libre « fulltext » technologyRobert Viseur
The presentation will be based on my personal experience on SQLite, MySQL and Zend Search ; on workshops I’ve attended (PostgreSQL) and on tests conducted under my supervision (PostgreSQL, MySQL, Sphinx, Lucene, Xapian). It will cover an exhaustive overview of existing techniques, from the most basic to the more advanced, and will lead to a comparative table of the existing technology.
In this slidecast, Jeff Squyres from Cisco Systems presents: How to make MPI Awesome - MPI Sessions. As a proposal for future versions of the MPI Standard, MPI Sessions could become a powerful tool tool to improve system resiliency as we move towards exascale.
Watch the video presentation: http://wp.me/p3RLHQ-f4U
Learn more: http://blogs.cisco.com/performance/mpi-sessions-a-proposal-for-the-mpi-forum
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
Understanding object model and bytecode is a crucial part in understanding an interpreted object-oriented language. Smalltalk, one of the oldest object-oriented programming languages, has a great object model and has been used bytecode and VM since 1970s. It is interesting to compare Smalltalk's and Python's object model and bytecode. Guido once said "I remember being surprised by its use of metaclasses (which is quite different from that in Python or Ruby!) when I read about them much later. " and "Smalltalk's bytecode was a bigger influence of Python's bytecode though." It is interesting to compare Smalltalk's and Python's metacalss and bytecode.
A short introduction to the more advanced python and programming in general. Intended for users that has already learned the basic coding skills but want to have a rapid tour of more in-depth capacities offered by Python and some general programming background.
Execrices are available at: https://github.com/chiffa/Intermediate_Python_programming
Introduction to libre « fulltext » technologyRobert Viseur
The presentation will be based on my personal experience on SQLite, MySQL and Zend Search ; on workshops I’ve attended (PostgreSQL) and on tests conducted under my supervision (PostgreSQL, MySQL, Sphinx, Lucene, Xapian). It will cover an exhaustive overview of existing techniques, from the most basic to the more advanced, and will lead to a comparative table of the existing technology.
In this slidecast, Jeff Squyres from Cisco Systems presents: How to make MPI Awesome - MPI Sessions. As a proposal for future versions of the MPI Standard, MPI Sessions could become a powerful tool tool to improve system resiliency as we move towards exascale.
Watch the video presentation: http://wp.me/p3RLHQ-f4U
Learn more: http://blogs.cisco.com/performance/mpi-sessions-a-proposal-for-the-mpi-forum
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
.Net development with Azure Machine Learning (AzureML) Nov 2014Mark Tabladillo
Azure Machine Learning provides enterprise-class machine learning and data mining to the cloud. This presenter will cover 1) what AzureML is, 2) technical overview of AzureML for application development, 3) a reminder to consider SQL Server Data Mining, and 4) a recommend path for resources and next steps.
Requirements for next generation of Cloud Computing: Case study with multiple...David Lary
The next generation of cloud computing system will need to handle:
1. Multiple massive datasets (Large Storage)
2. Massive memory per node
3. Facilitate automation and scheduling of repetitive tasks
4. Include high level technical languages (e.g. Matlab)
What is reproducible research? Why should I use it? what tools should I use? This session will show you how to use scripts, version control and markdown to do better research.
I am shubham sharma graduated from Acropolis Institute of technology in Computer Science and Engineering. I have spent around 2 years in field of Machine learning. I am currently working as Data Scientist in Reliance industries private limited Mumbai. Mainly focused on problems related to data handing, data analysis, modeling, forecasting, statistics and machine learning, Deep learning, Computer Vision, Natural language processing etc. Area of interests are Data Analytics, Machine Learning, Machine learning, Time Series Forecasting, web information retrieval, algorithms, Data structures, design patterns, OOAD.
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Yury Leonychev
This is a English slides of my presentation about machine learning implementation for model web application. Some advices for developers, which decided to create the same implementation in real production environment.
An Introduction to Google Colab as a development Platform using Python as a programming language. Mentioning Tips that are beneficial to make the development experience much easier.
Reproducibility in artificial intelligenceCarlos Toxtli
In this presentation, we explore how artificial intelligence experiments can be reproduced by implementing three different approaches such as: Reproducibility frameworks, Reproducible benchmarking tools, and Reproducible standalone methods.
PyCon AU 2012 - Debugging Live Python Web ApplicationsGraham Dumpleton
Monitoring tools record the result of what happened to your web application when a problem arises, but for some classes of problems, monitoring systems are only a starting point. Sometimes it is necessary to take more intrusive steps to plan for the unexpected by embedding mechanisms that will allow you to interact with a live deployed web application and extract even more detailed information.
.Net development with Azure Machine Learning (AzureML) Nov 2014Mark Tabladillo
Azure Machine Learning provides enterprise-class machine learning and data mining to the cloud. This presenter will cover 1) what AzureML is, 2) technical overview of AzureML for application development, 3) a reminder to consider SQL Server Data Mining, and 4) a recommend path for resources and next steps.
Requirements for next generation of Cloud Computing: Case study with multiple...David Lary
The next generation of cloud computing system will need to handle:
1. Multiple massive datasets (Large Storage)
2. Massive memory per node
3. Facilitate automation and scheduling of repetitive tasks
4. Include high level technical languages (e.g. Matlab)
What is reproducible research? Why should I use it? what tools should I use? This session will show you how to use scripts, version control and markdown to do better research.
I am shubham sharma graduated from Acropolis Institute of technology in Computer Science and Engineering. I have spent around 2 years in field of Machine learning. I am currently working as Data Scientist in Reliance industries private limited Mumbai. Mainly focused on problems related to data handing, data analysis, modeling, forecasting, statistics and machine learning, Deep learning, Computer Vision, Natural language processing etc. Area of interests are Data Analytics, Machine Learning, Machine learning, Time Series Forecasting, web information retrieval, algorithms, Data structures, design patterns, OOAD.
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Yury Leonychev
This is a English slides of my presentation about machine learning implementation for model web application. Some advices for developers, which decided to create the same implementation in real production environment.
An Introduction to Google Colab as a development Platform using Python as a programming language. Mentioning Tips that are beneficial to make the development experience much easier.
Reproducibility in artificial intelligenceCarlos Toxtli
In this presentation, we explore how artificial intelligence experiments can be reproduced by implementing three different approaches such as: Reproducibility frameworks, Reproducible benchmarking tools, and Reproducible standalone methods.
PyCon AU 2012 - Debugging Live Python Web ApplicationsGraham Dumpleton
Monitoring tools record the result of what happened to your web application when a problem arises, but for some classes of problems, monitoring systems are only a starting point. Sometimes it is necessary to take more intrusive steps to plan for the unexpected by embedding mechanisms that will allow you to interact with a live deployed web application and extract even more detailed information.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Luciano Resende
In this session Luciano will explore the different projects that compose the Jupyter ecosystem; including Jupyter Notebooks, JupyterLab, JupyterHub and Jupyter Enterprise Gateway. Jupyter Notebooks are the current open standard for data science and AI model development, and IBM is dedicated to contributing to their success and adoption. Continuing the trend of building out the Jupyter ecosystem, Luciano will introduce Elyra. It's a project built to extend JupyterLab with AI-centric capabilities. He'll showcase the extensions that allow you to build Notebook Pipelines, execute notebooks as batch jobs, navigate and execute Python scripts, and tie neatly into Notebook versioning.
Introduction to metasploit framework
01.History of metasploit
02.Metasploit Design and architecture
03.Metasploit Editions
04.Metasploit Interface
05.Basic commands and foot-printing modules
Near real-time anomaly detection at Lyftmarkgrover
Near real-time anomaly detection at Lyft, by Mark Grover and Thomas Weise at Strata NY 2018.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/69155
Python’s simplicity and ease of onboarding enables even beginners to create practical scripts to automate repetitive tasks or processes that are often time-consuming. In this talk Seoweon shares her experience in putting Python to use from day one in automating such tasks, and provides practical tips in scoping and implementing practical automation.
1. Experience in automating tasks and business processes in Python with minimal programming knowledge
2. How to spot automation opportunities
3. Practical tips for successful implementation
Similar to Reproducibility and automation of machine learning process (20)
Using spark 1.2 with Java 8 and CassandraDenis Dus
Brief introduction in Spark data processing ideology, comparison Java 7 and Java 8 usage with Spark. Examples of loading and processing data with Spark Cassandra Loader.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
2. What this speech is about?
1. Data mining / Machine learning process
2. Workflow automation
3. Basic design concepts
4. Data pipelines
5. Available instruments
6. About my own experience
3. Process overview
1. Data Engineering – 80%
– Data extraction
– Data cleaning
– Data transformation
– Data normalization
– Feature extraction
2. Machine Learning – 20%
– Model fitting
– Hyperparameters tuning
– Model evaluation
CRISP-DM
4. Why automation?
1. You want to update models on regular basis
2. Make your data workflows more trustable
3. You can perform a data freeze (possibly)
4. A step to (more) reproducible experiments
5. Write once and enjoy every day
5. How: Conceptual requirements
1. Reuse code between training and evaluation
phases (as much as possible)
2. Its easier to log features then to extract them
from data in retrospective way (if you can)
3. Solid environment is more important for the
first iteration then the quality of your model
4. Better to use the same language everywhere
(integration becomes much easier)
5. Every model requires support after deployment
6. You’d better know the rules of the game…
6. Feel free to download from author’s personal web page:
http://martin.zinkevich.org/rules_of_ml/
8. How: Technical requirements
1. Simple way to define DAGs of batch tasks
2. Tasks parameterization
3. Ability to store intermediate results
(checkpointing)
4. Tasks dependencies resolution
5. Automatic failures processing
6. Logging, notifications
7. Execution state monitoring
8. Python-based solution (we are on PyCon )
9. https://github.com/pinterest/pinball
Pinball (Pinterest)
1. Nice UI
2. Dynamic pipelines
generation
3. Pipelines configuration in
Python code (?)
4. Parameterization through
shipping python dicts (?)
5. In fact, not documented
6. Seems like no other big
players use this
10. https://github.com/apache/incubator-airflow
Airflow (AirBnB, Apache Incubator)
1. Very nice UI
2. Dynamic pipelines
generation
3. Orchestration through
message queue
4. Code shipping
5. Scheduler spawns workers
6. Pipelines configuration in
Python code
7. Parameterization through
tasks templates using Jinja
(Hmm…)
8. As for me, not so elegant as
written in documentation
11. https://github.com/spotify/luigi
Luigi (Spotify, Foursquare)
1. Simple UI
2. Dynamic pipelines
generation
3. Orchestration through
central scheduling (no
external components)
4. No code shipping
5. No scheduler
6. Pipelines configuration in
Python code (very elegant!)
7. Parameterization through
Parameters ()
8. Simple, well-tested
9. Good documentation
13. Luigi …
… is a Python module that helps you build complex pipelines
of batch jobs. It handles dependency resolution, workflow
management, visualization etc. It also comes with Hadoop
support built in.
… helps you stitch many tasks together, where each task can
be a Hive query, a Hadoop job in Java, a Spark job in Scala or
Python, a Python snippet, dumping a table from a database,
or anything else…
14. Luigi facts
1. Inspired by GNU Make
2. Everything in Luigi is in Python
3. Extremely simple (has only three main
classes: Target, Task, Parameter)
4. Each task must consume some input data
and may produce some output
5. Based on assumption of atomic writes
15. Luigi facts
1. Has no built-in scheduler (use crontab / run
manually from CLI)
2. You can not trigger any tasks from UI (its only
for monitoring purposes)
3. Master takes only orchestration role
4. Master does not ship your code to workers
16. Luigi fundamentals
Target corresponds to:
• file on local FS
• file on HDFS
• entry in DB
• any other kind of a checkpoint
Task:
• this is where execution takes place
• consume Targets that where created by other Tasks
• usually also outputs Target
• could depend on one or more other Tasks
• could have Parameters
17. Luigi Targets
• Have to implement exists method
• Write must be atomic
• Luigi comes with a toolbox of useful Targets:
luigi.LocalTarget(‘/home/path/to/some/file/’)
luigi.contrib.hdfs.HdfsTarget(‘/reports/%Y-%m-%d’)
luigi.postgres.PostgresTarget(…)
luigi.contrib.mysqldb.MySqlTarget(…)
luigi.contib.ftp.RemoteTarget(…)
… and many others …
• Built-in formats (GzipFormat is useful)
18. Luigi Tasks
• Main methods: run(), output(), requires()
• Write your code in run()
• Define your Target in output()
• Define dependencies using requires()
• Task is complete() if output Target exists()
19. Luigi Parameters
• Task that runs a Hadoop job every night?
• Luigi provides a lot of them:
luigi.parameter.Parameter
luigi.parameter.DateParameter
luigi.parameter.IntParameter
luigi.parameter.EnumParameter
luigi.parameter.ListParameter
luigi.parameter.DictParameter
… and etc …
• And automatically parses from CLI!
20. Execute from CLI: $ luigi MyTask --module your.cool.module --param 999
21. Central scheduling
• Luigi central scheduler (luigid)
– Doesn’t do any data processing
– Doesn’t execute any tasks
– Workers synchronization
– Tasks dependencies resolution
– Prevents same task run multiple times
– Provides administrative web interface
– Retries in case of failures
– Sends notifications (emails only)
• Luigi worker (luigi)
– Starts via cron / by hand
– Connects to central scheduler
– Defines tasks for execution
– Waits for permission to execute Task.run()
– Processes data, populates Targets
23. Execution model
Simplified process:
1. Some workers started
2. Each submits DAG of Tasks
3. Recursive check of Tasks completion
4. Worker receives Task to execute
5. Data processing!
6. Repeat
Client-server API:
1. add_task(task_id, worker_id, status)
2. get_work(worker_id)
3. ping(worker_id)
http://www.arashrouhani.com/luigid-basics-jun-2015/
25. Easy parallelization recipe
1. Do not use multiprocessing inside Task
2. Split huge Task into smaller ones and yield
them inside run() method
3. Run luigi with --workers N parameter
4. Make a separate job to combine all the
Targets (if you want)
5. Also it helps to minimize your possible data
loss in case of failures (atomic writes)
26. Luigi notifications
• luigi.notifications
• Built-in support for email notifications:
– SMTP
– Sendgrid
– Amazon SES / Amazon SNS
• Side projects for other channels:
– Slack (https://github.com/bonzanini/luigi-slack)
– …
28. Flo is the first period & ovulation tracker that uses neural networks*.
* OWHEALTH, INC. is the first company to publicly announce using neural networks for
menstrual cycle analysis and prediction.
29. • Top-level App in Apple Store and Google Play
• More than 6.5 million registered users
• More than 17.5 million tracked cycles
• Integration with wearable devices
• A lot of (partially) structured information
• Quite a lot work with data & machine learning
• And even more!
30. • About 450 GB of useful information:
– Cycles lengths history
– Ovulation and pregnancy tests results
– User profile data (Age, Height, Weight, …)
– Manual tracked events (Symptoms, Mood, …)
– Lifestyle statistics (Sleep, Activity, Nutrition, …)
– Biometrics data (Heart rate, Basal temperature, …)
– Textual data
– …
• Periodic model updates
31. Key points
• Base class for all models (sklearn-like interface)
• Shared code base for data and features extraction
during training and prediction phases
• Currently 450+ features extracted for each cycle
• Using individual-level submodels predictions (weak
predictors) as features for network input (strong
predictor)
• Semi-automatic model updates
• Model unit testing before deployment
• In practice heuristics combined with machine
learning
32. Model update in Flo =
• (Me) Trigger pipeline execution from CLI
• (Luigi) Executes ETL tasks (on live Postgres replica)
• (Luigi) Persists raw data on disk (data freeze)
• (Luigi) Executes features extraction tasks
• (Luigi) Persists dataset on disk
• (Luigi) Executes Neural Network fitting task
• (Tensorflow) A lot of operations with tensors
• (Me) Monitoring with TensorBoard and Luigi Web Interface
• (Me) Working on other tasks, reading Slack notifications
• (Me) Deploying model by hand (after unit testing)
• (Luigi, Me) Looking after model accuracy in production
33. Triggering pipeline
1. Class of model:
• Provides basic architecture of network
• Has predefined set of hyperparameters
2. Model build parameters:
• Sizes of some named layers
• Weights decay amount (L2 regularization technique)
• Dropout regularization amount
• Or what ever needed to compile Tensorflow / Theano computation graph
3. Model fit parameters:
• Number of fitting epochs
• Mini-batch size
• Learning rate
• Specific paths to store intermediate results
4. Data extraction parameters:
• Date of data freeze (raw data on disk)
• Segment of users for which we want to fit model
• Many other (used default values)
34. Model update in Flo: DAG
Fit network → Extract features → Fit submodels → Extract Raw → Data Train / Test split
35. Model update in Flo: Dashboard
Track DAG execution status in Luigi scheduler web interface:
36. Model update in Flo: Tensorboard
Track model fitting progress in Tensorboard:
37. Model update in Flo: Notifications
• Everything is OK:
• Some trouble with connection:
• Do I need to update the model?
38. Conclusion
Reproducibility and automation is about:
1. Process design (conceptual aspect)
– Think not only about experiments, but about further
integration too
– Known best practices
2. Process realization (technical aspect)
– Build solid data science environment
– Search for convenient instruments (Luigi seems like a
good starting point)
– Make your pipelines simple and easily extensible
– Make everything you can to make your pipelines trustful
– Monitoring is important aspect