This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Top contenders in the 2015 KDD cup include the team from DataRobot comprising Owen Zhang, #1 Ranked Kaggler and top Kagglers Xavier Contort and Sergey Yurgenson. Get an in-depth look as Xavier describes their approach. DataRobot allowed the team to focus on feature engineering by automating model training, hyperparameter tuning, and model blending - thus giving the team a firm advantage.
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Top contenders in the 2015 KDD cup include the team from DataRobot comprising Owen Zhang, #1 Ranked Kaggler and top Kagglers Xavier Contort and Sergey Yurgenson. Get an in-depth look as Xavier describes their approach. DataRobot allowed the team to focus on feature engineering by automating model training, hyperparameter tuning, and model blending - thus giving the team a firm advantage.
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Introductory presentation to Explainable AI, defending its main motivations and importance. We describe briefly the main techniques available in March 2020 and share many references to allow the reader to continue his/her studies.
Dowhy: An end-to-end library for causal inferenceAmit Sharma
In addition to efficient statistical estimators of a treatment's effect, successful application of causal inference requires specifying assumptions about the mechanisms underlying observed data and testing whether they are valid, and to what extent. However, most libraries for causal inference focus only on the task of providing powerful statistical estimators. We describe DoWhy, an open-source Python library that is built with causal assumptions as its first-class citizens, based on the formal framework of causal graphs to specify and test causal assumptions. DoWhy presents an API for the four steps common to any causal analysis---1) modeling the data using a causal graph and structural assumptions, 2) identifying whether the desired effect is estimable under the causal model, 3) estimating the effect using statistical estimators, and finally 4) refuting the obtained estimate through robustness checks and sensitivity analyses. In particular, DoWhy implements a number of robustness checks including placebo tests, bootstrap tests, and tests for unoberved confounding. DoWhy is an extensible library that supports interoperability with other implementations, such as EconML and CausalML for the the estimation step.
What is an "ensemble learner"? How can we combine different base learners into an ensemble in order to improve the overall classification performance? In this lecture, we are providing some answers to these questions.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
Open Source Tools & Data Science Competitions odsc
This talk shares the presenter’s experience with open source tools in data science competitions. In the past several years Kaggle and other competitions have created a large online community of data scientists. In addition to competing with each other for fame and glory, members of this community also generously share knowledge, insights using forum and open source code. The open competition and sharing have resulted in rapid progress in the sophistication of the entire community. This presentation will briefly cover this journey from a competitor’s perspective, and share hands on tips on some open source tools proven popular and useful in recent competitions.
Kaggle is a community of almost 400K data scientists who have built almost 2MM machine learning models to participate in our competitions. Data scientists come to Kaggle to learn, collaborate and develop the state of the art in machine learning. This talk will cover some of the lessons we have learned from the Kaggle community.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Introductory presentation to Explainable AI, defending its main motivations and importance. We describe briefly the main techniques available in March 2020 and share many references to allow the reader to continue his/her studies.
Dowhy: An end-to-end library for causal inferenceAmit Sharma
In addition to efficient statistical estimators of a treatment's effect, successful application of causal inference requires specifying assumptions about the mechanisms underlying observed data and testing whether they are valid, and to what extent. However, most libraries for causal inference focus only on the task of providing powerful statistical estimators. We describe DoWhy, an open-source Python library that is built with causal assumptions as its first-class citizens, based on the formal framework of causal graphs to specify and test causal assumptions. DoWhy presents an API for the four steps common to any causal analysis---1) modeling the data using a causal graph and structural assumptions, 2) identifying whether the desired effect is estimable under the causal model, 3) estimating the effect using statistical estimators, and finally 4) refuting the obtained estimate through robustness checks and sensitivity analyses. In particular, DoWhy implements a number of robustness checks including placebo tests, bootstrap tests, and tests for unoberved confounding. DoWhy is an extensible library that supports interoperability with other implementations, such as EconML and CausalML for the the estimation step.
What is an "ensemble learner"? How can we combine different base learners into an ensemble in order to improve the overall classification performance? In this lecture, we are providing some answers to these questions.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
Open Source Tools & Data Science Competitions odsc
This talk shares the presenter’s experience with open source tools in data science competitions. In the past several years Kaggle and other competitions have created a large online community of data scientists. In addition to competing with each other for fame and glory, members of this community also generously share knowledge, insights using forum and open source code. The open competition and sharing have resulted in rapid progress in the sophistication of the entire community. This presentation will briefly cover this journey from a competitor’s perspective, and share hands on tips on some open source tools proven popular and useful in recent competitions.
Kaggle is a community of almost 400K data scientists who have built almost 2MM machine learning models to participate in our competitions. Data scientists come to Kaggle to learn, collaborate and develop the state of the art in machine learning. This talk will cover some of the lessons we have learned from the Kaggle community.
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
Open innovation is a powerful strategy to accelerate innovation. This is a case study of how the fastest growing start-up of Indonesia leveraged open innovation.
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
Feature hashing is a powerful technique for handling high-dimensional features in machine learning. It is fast, simple, memory-efficient, and well suited to online learning scenarios. While an approximation, it has surprisingly low accuracy tradeoffs in many machine learning problems.
Feature hashing has been made somewhat popular by libraries such as Vowpal Wabbit and scikit-learn. In Spark MLlib, it is mostly used for text features, however its use cases extend more broadly. Many Spark users are not familiar with the ways in which feature hashing might be applied to their problems.
In this talk, I will cover the basics of feature hashing, and how to use it for all feature types in machine learning. I will also introduce a more flexible and powerful feature hashing transformer for use within Spark ML pipelines. Finally, I will explore the performance and scalability tradeoffs of feature hashing on various datasets.
Need to spark some killer innovation into your product line? Thinking about holding a brainstorming session? Brainstorming sessions are for wusses and wusses don’t get the corner office. Instead, you’ll learn some more productive techniques that can help you to release your inner-Hulk and become that guy that everyone wants on their next-generation product.
Note that there are a lot of build slides and formatting that slideshare has rendered poorly. Feel free to download the deck for best results or connect with me and I'll send you a copy.
What you till learn:
GOALS - What is the bar for data science teams
PITFALLS - What are common data science struggles
DIAGNOSES - Why so many of our efforts fail to deliver value
RECOMMENDATIONS - How to address these struggles with best practices
Presented by Mac Steele
Director of Product at Domino Data Lab
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
You love python. You love Data Science. But the size of your data set keeps crashing your code. Is it time to bring in big data tools or simply code smarter? Lee is going to show you efficiency hacks, drawn from top Kaggle competitors, to get python to work on large data sets. Skip the hassle of creating a Big Data infrastructure. Let’s find out how far we can push our home laptop first.
Ethics in Data Science and Machine LearningHJ van Veen
Introduction and overview on ethics in data science and machine learning, variations and examples of algorithmic bias, and a call-to-action for self-regulation. Given by Thierry Silbermann as part of the Sao Paulo Machine Learning Meetup, theme: "Ethics".
https://www.linkedin.com/in/thierrysilbermann
https://twitter.com/silbermannt
https://github.com/thierry-silbermann
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth
A startup's hiring requirement are probably the hardest ones to satisfy. Find out how Practo, a health case startup based out of India filled it's tech hiring requirement in record time using HackerEarth
Most of analytics modeling work today focuses on the production of single-purpose "artisanal" models for predictions. This approach to analytics is fragile with respect to model consistency, reorganization, and resource availability. This talk will argue that instead the focus of analytics modeling should be toward the production of analytics interchangeable parts, which can be combined in creative ways to produce a wide variety of analytics results. This "nuts and bolts" approach allows analytics groups to produce results in an agile way where the time between ask and answer is determined by the right combination of analytics, rather than the modeling.
How hackathons can drive top line revenue growthHackerEarth
Innovation management overview
What is a hackathon?
Why hackathons?
Role of Hackathon in enterprise innovation
Leveraging hackathon-based innovation campaign for growth
Keys to conducting a successful hackathon
Smart Switchboard: An home automation systemHackerEarth
FRIUNO is a customisable smart switchboard that uses Artificial Intelligence.Using this switchboard, the user can control all their devices from anywhere in this world.
by Szilard Pafka
Chief Scientist at Epoch
Szilard studied Physics in the 90s in Budapest and has obtained a PhD by using statistical methods to analyze the risk of financial portfolios. Next he has worked in finance quantifying and managing market risk. A decade ago he moved to California to become the Chief Scientist of a credit card processing company doing what now is called data science (data munging, analysis, modeling, visualization, machine learning etc). He is the founder/organizer of several data science meetups in Santa Monica, and he is also a visiting professor at CEU in Budapest, where he teaches data science in the Masters in Business Analytics program.
While extracting business value from data has been performed by practitioners for decades, the last several years have seen an unprecedented amount of hype in this field. This hype has created not only unrealistic expectations in results, but also glamour in the usage of the newest tools assumably capable of extraordinary feats. In this talk I will apply the much needed methods of critical thinking and quantitative measurements (that data scientists are supposed to use daily in solving problems for their companies) to assess the capabilities of the most widely used software tools for data science. I will discuss in details two such analyses, one concerning the size of datasets used for analytics and the other one regarding the performance of machine learning software used for supervised learning.
Nick day, JGA Recruitment Payroll, Managing Director collaborated with HackerEarth and discussed actionable tips for recruiting & retaining best candidates in your talent pipeline.
Initializing and Optimizing Machine Learning Models describes the use of hyperparameters, how to use multiple algorithms and models, and how to score and evaluate models.
This slide gives brief overview of supervised, unsupervised and reinforcement learning. Algorithms discussed are Naive Bayes, K nearest neighbour, SVM,decision tree, Markov model.
Difference between regression and classification. difference between supervised and reinforcement, iterative functioning of Markov model and machine learning applications.
Generalized linear models (GLMs) and gradient boosting machines (GBMs) are two of the most widely used supervised learning approaches in all of commercial data science. GLMs have been the go-to predictive and inferential modeling tool for decades, but important mathematical and computational advances have been made in training GLMs in recent years. This talk will contrast H2O’s implementation of penalized GLM techniques with ordinary least squares and give specific hints for building regularized and accurate GLMs for both predictive and inferential purposes. As more organizations begin experimenting with and embracing algorithms from the machine learning tradition, GBMs have come to prominence due to their predictive accuracy, the ability to train on real-world data, and resistance to overfitting training data. This talk will give some background on the GBM approach, some insight into the H2O implementation, and some tips for tuning and interpreting GBMs in H2O.
Patrick's Bio:
Patrick Hall is a senior data scientist and product engineer at H2O.ai. Patrick works with H2O.ai customers to derive substantive business value from machine learning technologies. His product work at H2O.ai focuses on two important aspects of applied machine learning, model interpretability and model deployment. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning.
Prior to joining H2O.ai, Patrick held global customer facing roles and R & D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the 11th person worldwide to become a Cloudera certified data scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.
https://github.com/telecombcn-dl/dlmm-2017-dcu
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Machine learning workshop, session 3.
- Data sets
- Machine Learning Algorithms
- Algorithms by Learning Style
- Algorithms by Similarity
- People to follow
Diabetes Prediction Using Machine Learningjagan477830
Our proposed system aims at Predicting the number of Diabetes patients and eliminating the risk of False Negatives Drastically.
In proposed System, we use Random forest, Decision tree, Logistic Regression and Gradient Boosting Classifier to classify the Patients who are affected with Diabetes or not.
Random Forest and Decision Tree are the algorithms which can be used for both classification and regression.
The dataset is classified into trained and test dataset where the data can be trained individually, these algorithms are very easy to implement as well as very efficient in producing better results and can able to process large amount of data.
Even for large dataset these algorithms are extremely fast and can able to give accuracy of about over 90%.
Regression takes a group of random variables, thought to be predicting Y, and tries to find a mathematical relationship between them. This relationship is typically in the form of a straight line (linear regression) that best approximates all the individual data points.
This presentation inludes step-by step tutorial by including the screen recordings to learn Rapid Miner.It also includes the step-step-step procedure to use the most interesting features -Turbo Prep and Auto Model.
ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built.
This presentation discusses about following topics:
Types of Problems Solved Using Artificial Intelligence Algorithms
Problem categories
Classification Algorithms
Naive Bayes
Example: A person playing golf
Decision Tree
Random Forest
Logistic Regression
Support Vector Machine
Support Vector Machine
K Nearest Neighbors
Similar to How to Win Machine Learning Competitions ? (20)
How to hire a data scientist recruit pageHackerEarth
Read more: http://bit.ly/2OlopT5
A data scientist’s job is one of the most sought after jobs of the 21st century. But how do you hire a data scientist who fits the bill?
Build accurate assessment with question analyticsHackerEarth
Talent assessment platform: http://bit.ly/2SYhZYS
While finding developer talent is primarily owned by talent acquisition and recruitment teams, hiring
managers play a crucial role in creating assessments that accurately screen a candidate’s skills for a
particular role. And this is easier said than done. Learn more: http://bit.ly/2SYhZYS
Make your assessments more effective with test analyticsHackerEarth
Talent assessment platform: http://bit.ly/2SYhZYS
Accurate screening of candidates is possible only
with a high-quality assessment. Building quality in
assessments takes time and requires a consistent
measure of the process against standards that
define its quality
http://bit.ly/2YZ03iI
A data scientist's job is one of the most sought after jobs of the 21st century. But how do you hire a data scientist who fits the bill?
Changing landscape of Technical RecruitmentHackerEarth
83% of business leaders feel that talent is the #1 priority at their company, and talent acquisition plays that vital role of bringing in the right people. The efficiency of the recruiting function can make or break a business. So, a question arises: How should leaders measure the recruiting function and, more importantly, make it more efficient?
In this on-demand webinar, Sachin Gupta, CEO at HackerEarth, will discuss
How to recruit the right tech talent
Employer branding, automation, and candidate experience
Current trends in campus and volume hiring
Best practices to screen and assess candidates
Benefits of using technology to ease the hiring process
How to leverage technology to serve your specific business needs
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...HackerEarth
We live in an era of fierce competition, be it at an individual or at an organizational level.
Businesses today have to be ready to compete against each other as advancements in technologies take the world by storm.
Companies, therefore, have to seek ways to respond quickly to the ever-changing markets and their requirements to not only stay afloat but also to thrive and surpass their competitors.
Cross-training and multi-skilling can be looked at as one of the methods for improving the efficiency and productivity of individual employees, and hence the organization at large.
Topics to be covered
What is multi-skilling?
Advantages of multi-skilling
Categories of multi-skilling
Key elements of a multi-skilling program
How multi-skilling helps organizations.
Nick Day, Founder and CEO JGA Recruitment, elucidates the importance of recruiting the right talent for your business and shares about developing a proactive recruitment process, passive search and candidate mapping techniques, measuring the correct metrics, benefits of a bespoke candidate engagement strategy and methods to attract the top 15% of talent to maximize recruitment ROI.
According to the US labor department, the cost of a wrong hire on average is 1/3rd of the employee’s first-year salary. Assuming the salary of an entry-level programmer is $60,000. The cost one pays for making a wrong hire is at least $20,000.
Recruit, HackerEarth’s technical recruitment software, allows businesses to automatically assess technical skills using coding tests. It allows you to significantly reduce time spent on interviewing, scale the technical hiring process and make the process unbiased and skill driven.
Here’s why it is used by 4,000+ companies like Amazon, Apple, Walmart Labs and Bosch.
Pre-Built Question Library with 20,000+ programming questions.
Automated Code Evaluation based on pre-defined parameters.
32+ Programming Languages like JAVA, Python, PHP, SQL, C++, C#, Ruby and JS.
Detailed Test Reports for each candidate’s performance.
Plagiarism Detector to avoid and detect malpractices.
Custom Question Library for complete control over questions.
Remotely conduct detailed tests & increase your radius of hiring.
Reduce assessment period from weeks to hours.
Create tests for any tech role without any SME (subject matter expert) help.
Drastically reduce manpower required to conduct assessments.
Remove human error/bias from screening process.
For more info:
www.hackerearth.com/recruit
Contributing effectively to work requires two broad areas of expertise: technical competencies and interpersonal skills. While technical competencies are necessary, they are insufficient. In fact, 70 to 80 percent of people who fail at their jobs because they are inept at dealing with other people.
This presentation defines the three most fundamental interpersonal skills – listening, assertiveness, and conflict management – and describes how effective communication is at the core of positive interpersonal relationships. Participants learn the overall goals and key principles of communication as well as the primary reasons why people often do not communicate well.
The presentation will address the following areas in the field of interpersonal dynamics:
Interpersonal competencies
Emotional intelligence
Ego states
Life positions
Types of interpersonal behavior
Conflict management
Interpersonal dynamics styles
For more info:
www.hackerearth.com/recruit
The world is fueled by data, and HR professionals everywhere are wondering how to leverage tons of people data for better insights to enhance individual and organizational performance.
HR analytics entails the use of tools (say, big data, predictive analytics) by HR in their recruiting, compensation, performance measurement, and retention efforts.
Through this presentation, you will get an introduction to HR analytics and how you can make the most of it to drive sweeping strategic success. This presentation will address the following areas of the employer branding:
- Purposeful Analytics
- Basics of Data Analysis
- Understanding the Fundamentals of Analytics Capability
Building
- Establishing an Analytical Unit and the Right Culture
- Levels & Types of HR Metrics
- Linking Metrics to Analytics
- Workforce Analytics Model
For more info:
www.hackerearth.com/recruit
Leading change management has come a long way since the time resumes had to be hand-delivered by candidates to apply for a job. No wonder it has grown from being a luxury to a necessity for recruiters.
This Presentation will address the following areas of the hiring cycle:
Change Management and Leadership
Need & Relevance of Change
The Change Process
Approaches to Change
Models of Change
Impact of Change
Resistance to Change
Importance of Change Leadership
Importance of Change Leadership
Change Leadership in Action
For more info:
www.hackerearth.com/recruit
Employer Branding has come a long way since the time resumes had to be hand-delivered by candidates to apply for a job. No wonder it has grown from being a luxury to a necessity for recruiters.
This presentation will address the following areas of the hiring cycle:
Get Leadership Buy-in
Determine Stakeholders and Their Roles
Define the Strategy & Investment
Develop the Employee Value Proposition
Communicate the Message - Leverage the Right Channels
Create Employee Brand Ambassadors
Measure and Assess the Brand
For more info:
www.hackerearth.com/recruit
This report reinforces HackerEarth’s vision to help
entrepreneurs, academicians, and public officials to
understand the power of hackathons as a change agent,
enabling tomorrow’s innovation today to address a
multitude of challenges to create sustainability in every
sphere
This report is a product of our analysis of nearly 1000 hackathons conducted across 75 countries in the world during a two year period from 2015 onwards.
Types of Hackathons
Objectives / Goals of Hackathons
Analysis and Key Insights
This is a short guide on how to conduct a successful hackathon
based on expertise gleaned from conducting over 800+ hackathons.
What is a hackathon
Benefits of Hackathons
Offline Hackathons
Online Hackathons
Driving innovation is not an easy task. It is what companies all over the world strive for. Ensuring you don’t lose sight of the guidelines will help you run an effective innovation program. Here are 6 rules for corporate innovation.
How to assess & hire Java developers accurately?HackerEarth
The problem arises when you want to hire developers who have proven Java skills. How do you assess them with accuracy when you have no clue how Java works or have never worked in it?
Managing innovation: A Process OverviewHackerEarth
Innovation Management
Innovation is often thought of as a black box, where it either happens or it doesn't, and that it calls for more of an art and luck than science and grit. However, research and well-documented practices have shown that the process of innovation is as regimented as most in the corporate world, and can be honed and practiced.
Innovation management overview
Process of managing creativity and innovation
Best practices of leading innovative organizations
HackerEarth is pleased to announce its next session to help you understand what it really takes to become a data scientist.
Agenda of this session will include answers to the following questions:
- Why is it the best time to take up Data Science as a career?
- How can you take the first step in Data Science? (After all, first step is always the hardest!)
- How can you become better and progress fast?
- How is life after becoming a Data Scientist?
Speaker:
Jesse Steinweg-Woods is soon-to-be a Senior Data Scientist at tronc, working on recommender systems for articles and understanding customer behavior. Previously, he worked at Argo Group Insurance on new pricing models that took advantage of machine learning techniques. He received his PhD in Atmospheric Science from Texas A&M University, and his research focused on numerical weather and climate prediction.
mEo is one of the winners of the Smarter Than Yesterday hackathon conducted by PluralSight in association with HackerEarth.
Objective of the device: To bring down deaths due to lack of menstrual hygiene. The dream & what we built Luckily for us, what we dreamt of is what we built. Menstrual Health Reader is a wearable menstrual health device that can be attached and detached in a convenient manner to any of the menstrual hygiene products.
Future Scope - To make it a compact testing unit for a range of conditions that can be assessed from menstrual blood.
University of Houston, conducted it's fall hackathon, CodeRED in November. CodeRED used HackerEarth's hackathon management tool - Sprint to collect submissions, evaluate submissions and coordinate with the participants during the hackathon.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. How to Win Machine Learning
competitions By Marios Michailidis
It’s not the destination…it’s the
journey!
2. What is kaggle
• world's biggest predictive modelling
competition platform
• Half a million members
• Companies host data challenges.
• Usual tasks include:
– Predict topic or sentiment from text.
– Predict species/type from image.
– Predict store/product/area sales
– Marketing response
3. Inspired by Horse races…!
• At the University of Southampton, an
entrepreneur talked to us about how he was
able to predict the horse races with regression!
4. Was curious, wanted to learn more
• learned statistical tools (Like SAS, SPSS, R)
• I became more passionate!
• Picked up programming skills
5. Built KazAnova
• Generated a couple of algorithms and data
techniques and decided to make them public
so that others can gain from it.
• I released it at (http://www.kazanovaforanalytics.com/)
• Named it after ANOVA (Statistics) and
• KAZANI , mom’s last name.
6. Joined Kaggle!
• Was curious about Kaggle.
• Joined a few contests and learned lots .
• The community was very open to sharing and
collaboration.
7. Other interesting tasks!
• predict the correct answer from a science test,
using the Wikipedia.
• Predict which virus has infected a file.
• Predict the NCAA tournament!
• Predict the Higgs Bosson
• Out of many different numerical series predict
which one is the cause
8. 3 Years of modelling competitions
• Over 90 competitions
• Participated with 45 different teams
• 22 top 10 finishes
• 12 times prize winner
• 3 different modelling platforms
• Ranked 1st out of 480,000 data scientists
9. What's next
• PhD (UCL) about recommender systems.
• More kaggling (but less intense) !
10. So… what wins competitions?
In short:
• Understand the problem
• Discipline
• try problem-specific things or new approaches
• The hours you put in
• the right tools
• Collaboration.
• Ensembling
11. More specifically…
● Understand the problem and the function to optimize
● Choose what seems to be a good Algorithm for a problem and
iterate
• Clean the data
• Scale the data
• Make feature transformations
• Make feature derivations
• Use cross validation to
Make feature selections
Tune hyper parameters of the algorithm
– Do this for many other algorithms – Always exploiting their
benefits
– Find best way to combine or ensemble the different algorithms
● Different type of models and when to use them
12. Understand the metric to optimize
● For example in the competition it was AUC (Area Under the roc
Curve).
● This is a ranking metric
● It shows how consistently your good cases have higher score
than your bad cases.
● Not all algorithms optimize the metric you want.
● Common metrics:
– AUC,
– Classification accuracy,
– precision,
– NDCG,
– RMSE
– MAE
– Deviance
13. Choose the algorithm
● In most cases many algorithms are experimented
before finding the right one(s).
● Those who try more models/parameters have higher
chance to win contests than others.
● Any algorithm that makes sense to be used, should be
used.
14. In the algorithm: Clean the Data
● The data cleaning step is not independent of the chosen
algorithm. For different algorithms , different cleaning filters should
be applied.
● Treating missing values is really important, certain algorithms
are more forgiving than others
● In other occasions it may make more sense to carefully replace a
missing value with a sensible one (like average of feature), treat it
as separate category or even remove the whole observation.
● Similarly we search for outliers and again other models are more
forgiving, while in others the impact of outliers is detrimental.
● How to decide the best method?
– Try them all or
– Experience and literature, but mostly the first (bold)
15. In the algorithm: scaling the data
● Certain algorithms cannot deal with unscaled data.
● scale techniques
– Max scaler: Divide each feature with highest absolute value
– Normalization: (subtract mean and divide with standard
deviation)
– Conditional scaling : scale only under certain conditions (e.g
in medicine we tend to scale per subject to make their features
comparable)
16. In the algorithm: feature transformations
● For certain algorithms there is benefit in changing the features
because they help them converge faster and better.
● Common transformations will include:
1. LOG, SQRT (variable) , smoothens variables
2. Dummies for categorical variables
3. Sparse matrices To be able to compress the data
4. 1st derivatives : To smoothen data.
5. Weights of evidence (transforming variables while using
information of the target variable)
6. Unsupervised methods that reduce dimensionality (SVD, PCA,
ISOMAP, KDTREE, clustering
17. In the algorithm: feature derivations
● In many problems this is the most important thing. For example:
– Text classification : generate the corpus of words and make TFIDF
– Sounds : convert sounds to frequencies through Fourier
transformations
– Images : make convolution. E.g. break down an image to pixels and
extract different parts of the image.
– Interactions: Really important for some models. For our algorithms
too! E.g. have variables that show if an item is popular AND the
customer likes it.
– Other that makes sense: similarity features, dimensionality
reduction features or even predictions from other models as features.
18. In the algorithm: Cross-Validation
● This basically means that from my main set , I create RANDOMLY 2
sets. I built (train) my algorithm with the first one (lets call it training
set) and score the other (lets call it validation set). I repeat this
process multiple times and always check how my model performs on
the test set in respect to the metric I want to optimize.
● The process may look like:
1. For 10 (you choose how many X) times
1. Split the set in training (50%-90% of the original data)
2. And validation (50%-10% of the original data)
3. Then fit the algorithm on the training set
4. Score the validation set.
5. Save the result of that scoring in respect to the chosen metric.
2. Calculate the average of these 10 (X) times. That how much you
expect this score in real life an dis generally a good estimate.
● Remember to use a SEED to be bale to replicate these X splits
19. In Cross-Validation: Do feature selection
● It is unlikely that all features are useful and some of them may be
damaging your model, because…
– They do not provide new information (Colinearity)
– They are just not predictive (just noise)
● There are many ways to do feature selection:
– Run the algorithm and seek an internal measure to retrieve the most
important features (no cross validation)
– Forward selection with or with no cross validation
– Backward selection with or with no cross validation
– Noise injection
– Hybrid of all methods
● Normally a forward method is chosen with cv. That is we add a
feature and then we split the data in the exact X times as we did
before and we check whether our metric improved:
– If yes, the feature remains
– Else, Wiedersehen!
20. In Cross-Validation: Hyper Parameter Optimization
● This generally takes lots of time depending on the algorithm . For
example in a random forest, the best model would need to be
determined based on some parameters as number of trees, max
depth, minimum cases in a split, features to consider at each split
etc.
● One way to find the best parameters is to manually change one
(e.g. max_depth=10 instead of 9) while you keep everything else
constant. I found using this method helps you understand more
about what would work in a specific dataset.
● Another way is try many possible combinations of hyper
parameters. We normally do that with Grid Search where we
provide an array of all possible values to be trialled with cross
validation (e.g. try max_depth {8,9,10,11} and number of trees
{90,120,200,500}
21. Train Many Algorithms
● Try to exploit their strengths
● For example, focus on using linear features with linear
regression (e.g. the higher the age the higher the
income) and non-linear ones with Random forest
● Make it so that each model tries to capture something
new or even focus on different part of the data
22. Ensemble
● Key part (in winning competitions at least) to combine the various
models made .
● Remember, even a crappy model can be useful to some small
extend.
● Possible ways to ensemble:
– Simple average (Model1 prediction + model2 prediction)/2
– Average Ranks for AUC (simple average after converting to rank)
– Manually tune weights with cross validation
– Using Geomean weighted average
– Use Meta-Modelling (also called stack generalization or stacking)
● Check github for a complete example of these methods using the
Amazon comp hosted by kaggle : https://github.com/kaz-
Anova/ensemble_amazon (top 60 rank) .
23. Different models I have experimented vol 1
● Logistic/Linear/discriminant regression: Fast, Scalable,
Comprehensible, solid under high dimensionality, can be memory-
light. Best when relationships are linear or all features are
categorical. Good for text classification too .
● Random Forests : Probably the best one-off overall algorithm out
there (to my experience) . Fast, Scalable , memory-medium. Best
when all features are numeric-continuous and there are strong non-
linear relationships. Does not cope well with high dimensionality.
● Gradient Boosting (Trees): Less memory intense as forests (as
individual predictors tend to be weaker). Fast, Semi-Scalable,
memory-medium. Is good when forests are good
● Neural Nets (AKA deep Learning): Good for tasks humans are
good at: Image Recognition, sound recognition. Good with categorical
variables too (as they replicate on-and-off signals). Medium-speed,
Scalable, memory-light . Generally good for linear and non-linear
tasks. May take a lot to train depending on structure. Many
parameters to tune. Very prone to over and under fitting.
24. ● Support Vector Machines (SVMs): Medium-Speed, not scalable,
memory intense. Still good at capturing linear and non linear
relationships. Holding the kernel matrix takes too much memory.
Not advisable for data sets bigger than 20k.
● K Nearest Neighbours: Slow (depending on the size), Not easily
scalable, memory-heavy. Good when really defining the good of
the bad is matter of how much he/she looks to specific individuals.
Also good when number of target variables are many as the
similarity measures remain the same across different
observations. Good for text classification too.
● Naïve Bayes : Quick, scalable, memory-ok. Good for quick
classifications on big datasets. Not particularly predictive.
● Factorization Machines: Good gateways between Linear and
non-linear problems. Stand between regressions , Knns and
neural networks. Memory Medium, semi-scalable, medium-speed.
Good for predicting the rating a customer will assign to a pruduct
Different models I have experimented vol 2
25. Tools vol 1
● Languages : Python, R, Java
● Liblinear : for linear models
http://www.csie.ntu.edu.tw/~cjlin/liblinear/
● LibSvm for Support Vector machines
www.csie.ntu.edu.tw/~cjlin/libsvm/
● Scikit package in python for text classification, random forests
and gradient boosting machines scikit-learn.org/stable/
● Xgboost for fast scalable gradient boosting
https://github.com/tqchen/xgboost
● LightGBM https://github.com/Microsoft/LightGBM
● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear
models
● http://www.heatonresearch.com/encog encog for neural nets
● H2O in R for many models
26. ● LibFm www.libfm.org
● LibFFM : https://www.csie.ntu.edu.tw/~cjlin/libffm/
● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/
● Graphchi for factorizations : https://github.com/GraphChi
● GraphLab for lots of stuff. https://dato.com/products/create/open_source.html
● Cxxnet : One of the best implementation of convolutional neural nets out
there. Difficult to install and requires GPU with NVDIA Graphics card.
https://github.com/antinucleon/cxxnet
● RankLib: The best library out there made in java suited for ranking algorithms
(e.g. rank products for customers) that supports optimization fucntions like
NDCG. people.cs.umass.edu/~vdang/ranklib.html
● Keras ( http://keras.io/) and Lasagne(https://github.com/Lasagne/Lasagne )
for nets. This assumes you have Theano
(http://deeplearning.net/software/theano/ ) or Tensorflow
https://www.tensorflow.org/ .
Tools vol 2
27. Where to go next to prepare for competitions
● Coursera : https://www.coursera.org/course/ml Andrew’s NG
class
● Kaggle.com : many competitions for learning. For instance:
http://www.kaggle.com/c/titanic-gettingStarted . Look for the
“knowledge flag”
● Very good slides from university of UTAH:
www.cs.utah.edu/~piyush/teaching/cs5350.html
● clopinet.com/challenges/ . Many past predictive modelling
competitions with tutorials.
● Wikipedia. Not to underestimate. Still the best source of
information out there (collectively) .