Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
FrugalML: Using ML APIs More Accurately and CheaplyDatabricks
Offering prediction APIs for fee is a fast growing industry and is an important aspect of machine learning as a service. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data and budget. We take a first step towards addressing this challenge by proposing FrugalML, a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint. Our theoretical analysis shows that natural sparsity in the formulation can be leveraged to make FrugalML efficient. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Baidu and other providers for tasks including facial emotion recognition, sentiment analysis and speech recognition. Across various tasks, FrugalML can achieve up to 90% cost reduction while matching the accuracy of the best single API, or up to 5% better accuracy while matching the best API’s cost.
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
Sarah: CEO-Finance-Report pipeline seems to be slow today. Why
Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days.
Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours.
Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert.
We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example:
Tom: I am not seeing any data for today in my Campaign Metrics Dashboard.
Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021.
This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.
Importance of ML Reproducibility & Applications with MLfLowDatabricks
With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata.
This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignJuliet Hougland
Users are constantly searching for new content and to stay competitive organizations must act immediately based on up-to-date data. Outdated recommendations decrease the likelihood of presenting the right offer and make it harder to maintain customer loyalty. In order to provide the most relevant recommendations and increase engagement, organizations must track customer interactions and re-score recommendations on the fly.
Data sources have expanded dramatically to include a wealth of historical data and a constant influx of behavior data. The key to moving from predictive models, applied in batch, to models that provide responses in real time, is to focus on the efficiency of model application. The speed that recommendations can be served is influenced by:
Architecture of the recommendation serving platform
Choice of recommendation algorithm
Datastore access patterns
In this presentation, we’ll discuss how developers can use open source components like HBase and Kiji to develop low-latency recommendation models that can be easily deployed by e-commerce companies. We will give practical advice on how to choose models and design data stores that make use of the architecture and quickly serve new recommendations.
FrugalML: Using ML APIs More Accurately and CheaplyDatabricks
Offering prediction APIs for fee is a fast growing industry and is an important aspect of machine learning as a service. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data and budget. We take a first step towards addressing this challenge by proposing FrugalML, a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint. Our theoretical analysis shows that natural sparsity in the formulation can be leveraged to make FrugalML efficient. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Baidu and other providers for tasks including facial emotion recognition, sentiment analysis and speech recognition. Across various tasks, FrugalML can achieve up to 90% cost reduction while matching the accuracy of the best single API, or up to 5% better accuracy while matching the best API’s cost.
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
Sarah: CEO-Finance-Report pipeline seems to be slow today. Why
Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days.
Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours.
Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert.
We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example:
Tom: I am not seeing any data for today in my Campaign Metrics Dashboard.
Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021.
This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.
Importance of ML Reproducibility & Applications with MLfLowDatabricks
With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata.
This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignJuliet Hougland
Users are constantly searching for new content and to stay competitive organizations must act immediately based on up-to-date data. Outdated recommendations decrease the likelihood of presenting the right offer and make it harder to maintain customer loyalty. In order to provide the most relevant recommendations and increase engagement, organizations must track customer interactions and re-score recommendations on the fly.
Data sources have expanded dramatically to include a wealth of historical data and a constant influx of behavior data. The key to moving from predictive models, applied in batch, to models that provide responses in real time, is to focus on the efficiency of model application. The speed that recommendations can be served is influenced by:
Architecture of the recommendation serving platform
Choice of recommendation algorithm
Datastore access patterns
In this presentation, we’ll discuss how developers can use open source components like HBase and Kiji to develop low-latency recommendation models that can be easily deployed by e-commerce companies. We will give practical advice on how to choose models and design data stores that make use of the architecture and quickly serve new recommendations.
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Flavio Clesio
Our presentation at Spark Summit EU 2017 - spark-summit.org/eu-2017/events/preventing-revenue-leakage-and-monitoring-distributed-systems-with-machine-learning/
This PPT Programming for data science in python mainly focus on importance of Python programming language in Python it explains the characteristic features of the programming language, its pros and cons and its applications.
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...Formulatedby
Presented by Hila Lamm, Chief Strategy Officer at Firefly.ai
Next DSS MIA Event - https://datascience.salon/miami/
Next DSS AUS Event - https://datascience.salon/austin/
With all the hype around auto machine learning for computer vision, businesses with structured data are left wondering: Is AutoML relevant for enterprise data? Can it alleviate the bottleneck that data science teams are experiencing?
Our team was experimenting with different types of enterprise challenges -- from optimizing pricing to credit card fraud detection to retail banking customer behavior -- and was able to automatically build models that produced top-ranking Kaggle results within a few hours. In this session, through customer use cases and under the hood insights, you will learn about the capabilities of AutoML as applied on Firefly. Oh, and we’ll also talk about how we attained a Kaggle 1st place score in just half an hour.
How a global manufacturing company built a data science capability from scratchCarlo Torniai
In less than a year, Pirelli, a global manufacturing company best known for high-performance tires and motorsports, grew an impactful data science capability from the ground up. I am sharing a how-to guide for doing the same in your organization, equipping you with arguments to marshal and concrete tips to follow, while calling out pitfalls to watch out for along the way.
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/vUqC8UPw9SU
Description:
The good news is building fair, accountable, and transparent machine learning systems is possible. The bad news is it’s harder than many blogs and software package docs would have you believe. The truth is nearly all interpretable machine learning techniques generate approximate explanations, that the fields of eXplainable AI (XAI) and Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) are very new, and that few best practices have been widely agreed upon. This combination can lead to some ugly outcomes! This talk aims to make your interpretable machine learning project a success by describing fundamental technical challenges you will face in building an interpretable machine learning system, defining the real-world value proposition of approximate explanations for exact models, and then outlining the following viable techniques for debugging, explaining, and testing machine learning models: *Model visualizations including decision tree surrogate models, individual conditional expectation (ICE) plots, partial dependence plots, and residual analysis. *Reason code generation techniques like LIME, Shapley explanations, and Treeinterpreter. *Sensitivity Analysis. Plenty of guidance on when, and when not, to use these techniques will also be shared, and the talk will conclude by providing guidelines for testing generated explanations themselves for accuracy and stability. Open source examples (with lots of comments and helpful hints) for building interpretable machine learning systems are available to accompany the talk at: https://github.com/jphall663/interpretable_machine_learning_with_python Bio: Patrick Hall is senior director for data science products at H2O.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Prior to joining H2O.ai, Patrick held global customer facing roles and research and development roles at SAS Institute.
Speaker's Bio:
Patrick Hall is a senior director for data science products at H2o.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Prior to joining H2o.ai, Patrick held global customer facing roles and R & D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick was the 11th person worldwide to become a Cloudera certified data scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
Machine learning in the enterprise is rarely delivered by a single team. In order to enable Machine Learning across an organisation you need to target a variety of different skills, processes, technologies, and maturities. To do this is incredibly hard and requires a composite of different techniques to deliver a single platform which empowers all users to build and deploy machine learning models.
In this session we discuss how Azure & Databricks enables a Data Science as a Service platform. We look at how a DSaaS platform is empowering users of all abilities to build models, deploy models and enabling organisations to realise and return on investment earlier.
Rsqrd AI: How to Design a Reliable and Reproducible PipelineSanjana Chowdhury
In this talk, David Aronchick, co-founder of Kubeflow and Microsoft's Head of Open Source ML, talks about designing reproducible and reliable ML pipelines. He speaks about the importance and impact of MLOps and use of metadata in pipelines. He also talks about a library he wrote to help with this problem, MLSpecLib.
**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/oxLZZMR1lVY
Description:
Driverless AI is H2O.ai's latest flagship product for automatic machine learning. It fully automates some of the most challenging and productive tasks in applied data science such as feature engineering, model tuning, model ensembling and model deployment. Driverless AI turns Kaggle-winning grandmaster recipes into production-ready code, and is specifically designed to avoid common mistakes such as under- or overfitting, data leakage or improper model validation, some of the hardest challenges in data science. Avoiding these pitfalls alone can save weeks or more for each model, and is necessary to achieve high modeling accuracy, especially for time-series problems.
With Driverless AI, data scientists of all proficiency levels can train and deploy modeling pipelines with just a few clicks from the GUI. Advanced users can use the client API from Python. Driverless AI builds hundreds or thousands of models under the hood to select the best feature engineering and modeling pipeline for every specific problem such as churn prediction, fraud detection, real-estate pricing, store sales prediction, marketing ad campaigns and many more.
To speed up training, Driverless AI uses highly optimized C++/CUDA algorithms to take full advantage of the latest compute hardware. For example, Driverless AI runs orders of magnitudes faster on the latest Nvidia GPU supercomputers on Intel and IBM platforms, both in the cloud or on premise. Driverless AI is fully supported on all major cloud providers.
There are two more product innovations in Driverless AI: statistically rigorous automatic data visualization and machine learning interpretability with reason codes and explanations in plain English. Both help data scientists and analysts to quickly validate the data and the models.
In this talk, we explain how Driverless AI works and show how easy it is to reach top 5% rankings for several highly competitive Kaggle competitions. (edited)
Speaker's Bio:
Arno Candel is the Chief Technology Officer at H2O.ai. He is the main committer of H2O-3 and Driverless AI and has been designing and implementing high-performance machine-learning algorithms since 2012. Previously, he spent a decade in supercomputing at ETH and SLAC and collaborated with CERN on next-generation particle accelerators. Arno holds a PhD and Masters summa cum laude in Physics from ETH Zurich, Switzerland. He was named “2014 Big Data All-Star” by Fortune Magazine and featured by ETH GLOBE in 2015. Follow him on Twitter: @ArnoCandel.
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
In the second part of our applied machine learning online course, you'll get an overview of the different steps in the data science workflow as well as a deep dive in 3 basic types of models: linear, tree-based and clustering.
Data & AI Platforms — Open Source Vs Managed Services (AWS vs Azure vs GCP)Ankit Rathi
While designing and building Data & AI platforms, you may need to evaluate the options available. Whether your platform would be on-premise or you could use cloud/s services or you would take a hybrid approach.
In any case, you may need to look and evaluate various tools & services for your ingestion, storage, process/analysis and serving layers.
In this post, I have mapped open-source and popular managed cloud services to make our evaluation process a bit easier.
Machine Learning system architecture – Microsoft Translator, a Case Study : ...Vishal Chowdhary
Microsoft Translator currently supports 100+ languages. We constantly improve the translation quality, add new scenarios, all with a constant team size. This session describes a production scale machine learning architecture using MS Translator as a case study. You will learn the mental model to approach your ML problem and concrete Do’s and Don’ts for the various components of the ML system architecture.
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
Organizations rely heavily on time series metrics to measure and model key aspects of operational and business performance. The ability to reliably detect issues with these metrics is imperative to identifying early indicators of major problems before they become pervasive. This is a difficult machine learning and systems problem because temporal patterns are complex, ever changing, and often very noisy, traditionally requiring significant manual configuration and model maintenance.
At Zillow, we have built an orchestration framework around Luminaire, our open-source python library for hands-off time-series Anomaly Detection. Luminaire provides a suite of models and built-in AutoML capabilities which we process with Spark for distributed training and scoring of thousands of metrics. In this talk, we will cover the architecture of this framework and performance of the Luminaire package across detection and prediction accuracy as well as runtime efficiency.
MLOps: From Data Science to Business ROI
This deck describes why operationalizing ML (running ML and DL in production and managing the full production lifecycle) is challenging. We also describe MCenter and how it manages the ML lifecycle
Building predictive models in Azure Machine LearningMostafa
This presentation covers how to build and drive insights from data by building machine learning models. The session covers how to develop and train models in Python/R using Azure Machine Learning. The session covers how to explore key concepts in data acquisition, preparation, exploration, and visualization, and take a look at how to build a predictive solution using Azure Machine Learning, R, and Python. The session covers tips and tricks on selecting the right algorithm for your data science problem and how to utilize Machine Learning to solve it.
DN18 | Applied Machine Learning in Cybersecurity: Detect malicious DGA Domain...Dataconomy Media
Abstract of the Presentation:
Malware like GameOver Zeus and CryptoLocker Botnets are a massive threat for organizations. They use domain generation algorithms (DGAs) to create URLs that host malicious websites or command and control servers. Traditional approaches fail to detect and stop them early. In this Talk you learn in a live demo how you can use machine learning to detect malicious domains in your environment and learn how to implement a full end to end data science use case leveraging the Splunk Machine Learning Toolkit.
About the Author:
Philipp works as Staff Machine Learning Architect at Splunk. His background is in data sciene, visualization and analytics with experience in automotive, transportation and software industries. He enjoys working with Splunk customers and partners across EMEA.
"Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. In a typical machine learning application, practitioners must apply the appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning. Following those preprocessing steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their final machine learning model. As many of these steps are often beyond the abilities of non-experts, AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. Automating the end-to-end process of applying machine learning offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform models that were designed by hand."
In this talk we will discuss how QuSandbox and the Model Analytics Studio can be used in the selection of machine learning models. We will also illustrate AutoML frameworks through demos and examples and show you how to get started
Brief introduction to Cerved data, the role of data scientist in Cerved and how a data scientist can take advantage from graph database.
Bio:
Stefano Gatti: Born in 1970, has been involved for more than 15 years in several big data and technologies driven projects in leading business information companies like Lince and Cerved. He is very fond of agile metodologies, trying to apply them at all organizational levels. In last years he is strongly engaged in facilitating in Cerved the spread of innovation and the taking advantage from the new big and smart data technologies especially from a business usage perspective. datatelling, open innovation, partnership with smart actors of worldwide data driven innovation ecosystem are his actual mantra. Nunzio Pellegrino: Data Scientist in Cerved, as part of Innovation team, with focus on extract value from data and resolve problems with the latest technologies available. I’ve a degree in Statistics with background in Machine Learning. I’ve being worked primarily in Data Integration and Business Intelligence projects for 3 years. In this moment, I’m product owner of a web application based on GraphDB and involved in Italian Open Data projects. I’m a R enthusiastic, Python practitioner and fascinated of graph ecosystem.
Explains: What is Data Science? What is the difference between Data Science and Data Engineering, and between Data Science and Business Intelligence? What type of work do Data Scientists do, and what types of companies employ them? What is the job outlook for Data Science? What professional education is required?
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Flavio Clesio
Our presentation at Spark Summit EU 2017 - spark-summit.org/eu-2017/events/preventing-revenue-leakage-and-monitoring-distributed-systems-with-machine-learning/
This PPT Programming for data science in python mainly focus on importance of Python programming language in Python it explains the characteristic features of the programming language, its pros and cons and its applications.
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...Formulatedby
Presented by Hila Lamm, Chief Strategy Officer at Firefly.ai
Next DSS MIA Event - https://datascience.salon/miami/
Next DSS AUS Event - https://datascience.salon/austin/
With all the hype around auto machine learning for computer vision, businesses with structured data are left wondering: Is AutoML relevant for enterprise data? Can it alleviate the bottleneck that data science teams are experiencing?
Our team was experimenting with different types of enterprise challenges -- from optimizing pricing to credit card fraud detection to retail banking customer behavior -- and was able to automatically build models that produced top-ranking Kaggle results within a few hours. In this session, through customer use cases and under the hood insights, you will learn about the capabilities of AutoML as applied on Firefly. Oh, and we’ll also talk about how we attained a Kaggle 1st place score in just half an hour.
How a global manufacturing company built a data science capability from scratchCarlo Torniai
In less than a year, Pirelli, a global manufacturing company best known for high-performance tires and motorsports, grew an impactful data science capability from the ground up. I am sharing a how-to guide for doing the same in your organization, equipping you with arguments to marshal and concrete tips to follow, while calling out pitfalls to watch out for along the way.
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/vUqC8UPw9SU
Description:
The good news is building fair, accountable, and transparent machine learning systems is possible. The bad news is it’s harder than many blogs and software package docs would have you believe. The truth is nearly all interpretable machine learning techniques generate approximate explanations, that the fields of eXplainable AI (XAI) and Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) are very new, and that few best practices have been widely agreed upon. This combination can lead to some ugly outcomes! This talk aims to make your interpretable machine learning project a success by describing fundamental technical challenges you will face in building an interpretable machine learning system, defining the real-world value proposition of approximate explanations for exact models, and then outlining the following viable techniques for debugging, explaining, and testing machine learning models: *Model visualizations including decision tree surrogate models, individual conditional expectation (ICE) plots, partial dependence plots, and residual analysis. *Reason code generation techniques like LIME, Shapley explanations, and Treeinterpreter. *Sensitivity Analysis. Plenty of guidance on when, and when not, to use these techniques will also be shared, and the talk will conclude by providing guidelines for testing generated explanations themselves for accuracy and stability. Open source examples (with lots of comments and helpful hints) for building interpretable machine learning systems are available to accompany the talk at: https://github.com/jphall663/interpretable_machine_learning_with_python Bio: Patrick Hall is senior director for data science products at H2O.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Prior to joining H2O.ai, Patrick held global customer facing roles and research and development roles at SAS Institute.
Speaker's Bio:
Patrick Hall is a senior director for data science products at H2o.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Prior to joining H2o.ai, Patrick held global customer facing roles and R & D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick was the 11th person worldwide to become a Cloudera certified data scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
Machine learning in the enterprise is rarely delivered by a single team. In order to enable Machine Learning across an organisation you need to target a variety of different skills, processes, technologies, and maturities. To do this is incredibly hard and requires a composite of different techniques to deliver a single platform which empowers all users to build and deploy machine learning models.
In this session we discuss how Azure & Databricks enables a Data Science as a Service platform. We look at how a DSaaS platform is empowering users of all abilities to build models, deploy models and enabling organisations to realise and return on investment earlier.
Rsqrd AI: How to Design a Reliable and Reproducible PipelineSanjana Chowdhury
In this talk, David Aronchick, co-founder of Kubeflow and Microsoft's Head of Open Source ML, talks about designing reproducible and reliable ML pipelines. He speaks about the importance and impact of MLOps and use of metadata in pipelines. He also talks about a library he wrote to help with this problem, MLSpecLib.
**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/oxLZZMR1lVY
Description:
Driverless AI is H2O.ai's latest flagship product for automatic machine learning. It fully automates some of the most challenging and productive tasks in applied data science such as feature engineering, model tuning, model ensembling and model deployment. Driverless AI turns Kaggle-winning grandmaster recipes into production-ready code, and is specifically designed to avoid common mistakes such as under- or overfitting, data leakage or improper model validation, some of the hardest challenges in data science. Avoiding these pitfalls alone can save weeks or more for each model, and is necessary to achieve high modeling accuracy, especially for time-series problems.
With Driverless AI, data scientists of all proficiency levels can train and deploy modeling pipelines with just a few clicks from the GUI. Advanced users can use the client API from Python. Driverless AI builds hundreds or thousands of models under the hood to select the best feature engineering and modeling pipeline for every specific problem such as churn prediction, fraud detection, real-estate pricing, store sales prediction, marketing ad campaigns and many more.
To speed up training, Driverless AI uses highly optimized C++/CUDA algorithms to take full advantage of the latest compute hardware. For example, Driverless AI runs orders of magnitudes faster on the latest Nvidia GPU supercomputers on Intel and IBM platforms, both in the cloud or on premise. Driverless AI is fully supported on all major cloud providers.
There are two more product innovations in Driverless AI: statistically rigorous automatic data visualization and machine learning interpretability with reason codes and explanations in plain English. Both help data scientists and analysts to quickly validate the data and the models.
In this talk, we explain how Driverless AI works and show how easy it is to reach top 5% rankings for several highly competitive Kaggle competitions. (edited)
Speaker's Bio:
Arno Candel is the Chief Technology Officer at H2O.ai. He is the main committer of H2O-3 and Driverless AI and has been designing and implementing high-performance machine-learning algorithms since 2012. Previously, he spent a decade in supercomputing at ETH and SLAC and collaborated with CERN on next-generation particle accelerators. Arno holds a PhD and Masters summa cum laude in Physics from ETH Zurich, Switzerland. He was named “2014 Big Data All-Star” by Fortune Magazine and featured by ETH GLOBE in 2015. Follow him on Twitter: @ArnoCandel.
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
In the second part of our applied machine learning online course, you'll get an overview of the different steps in the data science workflow as well as a deep dive in 3 basic types of models: linear, tree-based and clustering.
Data & AI Platforms — Open Source Vs Managed Services (AWS vs Azure vs GCP)Ankit Rathi
While designing and building Data & AI platforms, you may need to evaluate the options available. Whether your platform would be on-premise or you could use cloud/s services or you would take a hybrid approach.
In any case, you may need to look and evaluate various tools & services for your ingestion, storage, process/analysis and serving layers.
In this post, I have mapped open-source and popular managed cloud services to make our evaluation process a bit easier.
Machine Learning system architecture – Microsoft Translator, a Case Study : ...Vishal Chowdhary
Microsoft Translator currently supports 100+ languages. We constantly improve the translation quality, add new scenarios, all with a constant team size. This session describes a production scale machine learning architecture using MS Translator as a case study. You will learn the mental model to approach your ML problem and concrete Do’s and Don’ts for the various components of the ML system architecture.
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
Organizations rely heavily on time series metrics to measure and model key aspects of operational and business performance. The ability to reliably detect issues with these metrics is imperative to identifying early indicators of major problems before they become pervasive. This is a difficult machine learning and systems problem because temporal patterns are complex, ever changing, and often very noisy, traditionally requiring significant manual configuration and model maintenance.
At Zillow, we have built an orchestration framework around Luminaire, our open-source python library for hands-off time-series Anomaly Detection. Luminaire provides a suite of models and built-in AutoML capabilities which we process with Spark for distributed training and scoring of thousands of metrics. In this talk, we will cover the architecture of this framework and performance of the Luminaire package across detection and prediction accuracy as well as runtime efficiency.
MLOps: From Data Science to Business ROI
This deck describes why operationalizing ML (running ML and DL in production and managing the full production lifecycle) is challenging. We also describe MCenter and how it manages the ML lifecycle
Building predictive models in Azure Machine LearningMostafa
This presentation covers how to build and drive insights from data by building machine learning models. The session covers how to develop and train models in Python/R using Azure Machine Learning. The session covers how to explore key concepts in data acquisition, preparation, exploration, and visualization, and take a look at how to build a predictive solution using Azure Machine Learning, R, and Python. The session covers tips and tricks on selecting the right algorithm for your data science problem and how to utilize Machine Learning to solve it.
DN18 | Applied Machine Learning in Cybersecurity: Detect malicious DGA Domain...Dataconomy Media
Abstract of the Presentation:
Malware like GameOver Zeus and CryptoLocker Botnets are a massive threat for organizations. They use domain generation algorithms (DGAs) to create URLs that host malicious websites or command and control servers. Traditional approaches fail to detect and stop them early. In this Talk you learn in a live demo how you can use machine learning to detect malicious domains in your environment and learn how to implement a full end to end data science use case leveraging the Splunk Machine Learning Toolkit.
About the Author:
Philipp works as Staff Machine Learning Architect at Splunk. His background is in data sciene, visualization and analytics with experience in automotive, transportation and software industries. He enjoys working with Splunk customers and partners across EMEA.
"Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. In a typical machine learning application, practitioners must apply the appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning. Following those preprocessing steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their final machine learning model. As many of these steps are often beyond the abilities of non-experts, AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. Automating the end-to-end process of applying machine learning offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform models that were designed by hand."
In this talk we will discuss how QuSandbox and the Model Analytics Studio can be used in the selection of machine learning models. We will also illustrate AutoML frameworks through demos and examples and show you how to get started
Brief introduction to Cerved data, the role of data scientist in Cerved and how a data scientist can take advantage from graph database.
Bio:
Stefano Gatti: Born in 1970, has been involved for more than 15 years in several big data and technologies driven projects in leading business information companies like Lince and Cerved. He is very fond of agile metodologies, trying to apply them at all organizational levels. In last years he is strongly engaged in facilitating in Cerved the spread of innovation and the taking advantage from the new big and smart data technologies especially from a business usage perspective. datatelling, open innovation, partnership with smart actors of worldwide data driven innovation ecosystem are his actual mantra. Nunzio Pellegrino: Data Scientist in Cerved, as part of Innovation team, with focus on extract value from data and resolve problems with the latest technologies available. I’ve a degree in Statistics with background in Machine Learning. I’ve being worked primarily in Data Integration and Business Intelligence projects for 3 years. In this moment, I’m product owner of a web application based on GraphDB and involved in Italian Open Data projects. I’m a R enthusiastic, Python practitioner and fascinated of graph ecosystem.
Explains: What is Data Science? What is the difference between Data Science and Data Engineering, and between Data Science and Business Intelligence? What type of work do Data Scientists do, and what types of companies employ them? What is the job outlook for Data Science? What professional education is required?
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...EMC
An examination of the trends of Big Data and Advanced Analytics as well as the technology, services and education needed to thrive in this new field. This session explores examples of true industry-disruptive analytics-driven transformations and the catalysts for transformation. Examining the role of people is paramount to success in order to develop a high-performing data scientist team - starting today.
This session describes the roles and skill sets required when building a Data Science team, and starting a data science initiative, including how to develop Data Science capabilities, select suitable organizational models for Data Science teams, and understand the role of executive engagement for enhancing analytical maturity at an organization.
Objective 1: Understand the knowledge and skills needed for a Data Science team and how to acquire them.
After this session you will be able to:
Objective 2: Learn about the different organizational models for forming a Data Science team and how to choose the best for your organization.
Objective 3: Understand the importance of Executive support for Data Science initiatives and role it plays in their successful deployment.
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
Watch full webinar here: https://bit.ly/3offv7G
Presented at AI Live APAC
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spend most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Watch this on-demand session to learn how companies can use data virtualization to:
- Create a logical architecture to make all enterprise data available for advanced analytics exercise
- Accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- Integrate popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch: https://bit.ly/2DYsUhD
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- How Prologis accelerated their use of Machine Learning with data virtualization
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
Watch full webinar here: https://bit.ly/3mJJ4w9
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spend most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this session to learn how companies can use data virtualization to:
- Create a logical architecture to make all enterprise data available for advanced analytics exercise
- Accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- Integrate popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15MLconf
Incorporating the Real Time Component into Analytics and Machine Learning: Many industries and organizations today want to harness the power of big data analytics and machine learning for its potential to improve margins, enhance discoveries, give insight into the business, and enable fast data driven decisions. The challenges include inability and/or difficulties in using available systems, not knowing where to start or which tools make sense for a particular problem, and dealing with data sets that are too big, too fast, or too complicated to handle with traditional systems.
RTDS Inc. has developed SymetryMLTM which are technologies for zero latency machine learning and analytics/exploration of very large datasets in real time, with a focus on speed, accuracy and simplicity. Our goal has been to cut the memory footprint required to learn large data sets, “reducer” functionality to automatically select the best attributes for model creation and build models on the fly. SymetryMLTM is also designed for easy integration into existing business processes via either an easy to use Web-UI or RESTful APIs.
This talk will explore some of the functionality of these systems including real time exploration of data, fast multi-variate model prototyping, and our use of GPUs and parallelization. An example of brain related data and the complexities of analytics will be discussed as well as a brief overview of other verticals we are exploring. Our work is geared towards making big data make sense in real time and enable users to gain insights faster than traditional methods.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/32c6TnG
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
Watch full webinar here: https://bit.ly/35FUn32
Presented at CDAO New Zealand
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python, and Scala put advanced techniques at the fingertips of the data scientists.
However, most architecture laid out to enable data scientists miss two key challenges:
- Data scientists spend most of their time looking for the right data and massaging it into a usable format
- Results and algorithms created by data scientists often stay out of the reach of regular data analysts and business users
Watch this session on-demand to understand how data virtualization offers an alternative to address these issues and can accelerate data acquisition and massaging. And a customer story on the use of Machine Learning with data virtualization.
Predictive analytics is touching more and more lives every day. Machine Learning lets you predict and change the future. Do you know that Microsoft products like Xbox and Bing integrate some machine learning capabilities in their workflows? Come to the session and take a look of the new cloud-based machine learning platform called AzureML from a BI architect perspective, without all the data scientist knowledge.
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Denodo
Watch full webinar here: https://bit.ly/3xj6fnm
Presented at Chief Data Officer Live 2021 A/NZ
The world is changing faster than ever. And for companies to compete and succeed they need to be agile in order to respond quickly to market changes and emerging opportunities. Data plays an integral role in achieving this business agility. However, given the complex nature of the enterprise data architecture finding and analysing data is an increasingly challenging task. Data virtualization is a modern data integration technique that integrates data in real-time, without having to physically replicate it.
Watch on-demand this session to understand what data virtualization is and how it:
- Delivers data in real-time, and without replication
- Creates a logical architecture to provide a single view of truth
- Centralises the data governance and security framework
- Democratises data for faster decision making and business agility
ATMOSPHERE was invited to be a speaker at Think Milano event, on 6th June from 14.30 to 17.30, to join a panel discussion called “L’infrastruttura cloud ready protagonista del future” on how cloud infrastructures are important for different market sectors.
Data Analytics in your IoT SolutionFukiat Julnual, Technical Evangelist, Mic...BAINIDA
Data Analytics in your IoT SolutionFukiat Julnual, Technical Evangelist, Microsoft (Thailand) Limited ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND
Similar to Data Science as a Service: Intersection of Cloud Computing and Data Science (20)
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Data Science as a Service: Intersection of Cloud Computing and Data Science
1. Data Science as a Service
Dr. Pouria Amirian (Pouria.Amirian@ndm.ox.ac.uk)
Big Data Project Coordinator, The Global Health Network, University of Oxford
Intersection Of Cloud Computing And Data Science
2. outline
Data Science
What data science is
Steps in a Data Science project
Experiments
Using AzureML
Big Data issues
In a data science project
Methods in analysis
2
3. What is Data Science?
Practice of obtaining useful insights from data
3 Vs of Big Data:
Volume
Variety
Velocity
+ other Vs
It applies to large volume data (volume)
It applies to semi-structured and unstructured data (variety)
It sometimes applies to real-time or fast changing data
(velocity)
It applies to small and traditional static data
3
4. Data Science as a team sport
4
Math
Statistical Learning
Linguistics
Machine Learning
Signal Processing
Programming
Storage/Data StructureOperations Research
Distributed and High
Performance Computing
5. Data Science from analytics point of view
Analytics Spectrum:
5
Descriptive
Diagnostic
Predictive
Prescriptive
What happened?
Why did it happen?
What will happen?
What should I do?
6. Data Science Vs. Business Intelligence
Analytics Spectrum:
6
What happened?
Why did it happen?
What will happen?
What should I do?
Traditional BI
Descriptive
Diagnostic
Predictive
Prescriptive
7. Why is it so popular? why it matters?
A) More Available and Usable Data
McKensey: Organizations that use data science to make
decisions are more productive and deliver higher ROI
Gartner: Organizations that invest in modern data infrastructure
will outperform their peers by up to 20%
7
8. Why is it so popular? why it matters?
B) Increased Awareness of Machine Learning Techniques
A subset of machine learning algorithms are now more widely
understood since they have been tried and tested by early
adopters such as Netflix and Amazon (Recommendation engines).
while many people may not know details of the algorithms used,
they now increasingly understand their research/business value.
8
9. Why is it so popular? why it matters?
C) More Accuracte Analysis
The large volumes of data being collected also enables you to
build more accurate predictive models.
The larger sample size, the smaller the margin of error. This in turn
increases the accuracy of predictions from your model.
9
10. Why is it so popular? why it matters?
D) Faster and Cheaper Computation
Today, a smartphone’s processor is up to five times more
powerful than that of a desktop computer 20 years ago.
Price of computation is decreased
Capacity of computation is increased
dramatic gains in technology, productivity, innovations etc.
10
11. The Data Science Workflow
Problem Definition
Data Collection and
Preparation
Model
Development
Model
Deployment
Performance
Improvement
11
Critical
Very Important
Time Consuming
Fun :D
Iterative
Cumbersome :(
Critical
12. The Data Science Workflow
Problem Definition
Data Collection and
Preparation
Model
Development
Model
Deployment
Performance
Improvement
12
• Domain Knowledge
• Separation of Concerns
• Prioritize each problem
• Selection or right data
• Data Transformation
• Missing Values
• Exploratory analysis
• Right algorithm
• Test accuracy
• Test other algorithms
• Validate
• Turning data scientist model
to developer code
(R to C#)
• Monitor the performance of
deployed model
• Re-Training model
• Re-Deploying model
• Re-monitoring
13. The Data Science Workflow
Big Data
Issues (I)
Problem Definition
Data Collection and
Preparation
Model
Development
Model
Deployment
Performance
Improvement
13
14. Solutions to overcome the big data issues
14
1- Use advanced research computing
(http://www.arc.ox.ac.uk/)
15. Solutions to overcome the big data issues
15
2-Create and use a Hadoop Cluster
Open source (Apache)
It is based on two components
HDFS
MapReduce
22. AzureML (Azure Machine Learning)
Azure ML provides an easy-to-use and powerful set of cloud-
based data transformation and machine learning
tools.
AzureML Studio (or Studio for short)
It has many modules for data transformation, analysis,
visualization,…
It supports R and Python
It is under heavy development
www.studio.azureml.net
22
23. AzureML Workflow
23
Data Input
Data Transformation (Project)
Split Data(training and test)Learning Algorithm
Train the Learning Algorithm
Validate the Algorithm(Score)
Evaluate Model Performance
27. Fourth experiment: Creating Web service
Very easy just some clicks!!!!
Make: bmw
Engine-size: 164
Horse-power: 121
highway—mpg: 25
Its actual price is 24,565
27
28. Tips
Data input can come from a variety of data interfaces,
including HTTP connections (any filesharing service like
dropbox, googleDrive, oneDrive), SQLAzure, and Hive Query.
You can use functionality in all supported R modules (410)
You can write your utility functions and upload it as another
module
It is under heavy development
Two weeks ago the process for web service publication changed
Two months ago there was no support for Python
Two months ago around 400 R packages were supported
…
28
29. Big Data Issues (II)
High dimensional data or wide data
Using various methods needs knowledge of those methods
Traditional methods are not efficient enough (unstable)
Least Squares for example
29
30. Advantages of AzureML
Solutions can be quickly deployed as web services.
Models run in a highly scalable cloud environment.
using the R and Python language for solution-specific
functionality.
It creates minimum code for consuming the web service
in R and Python (and C#)
It can be run from anywhere
30
31. “
”
Big Data is not about Data.
The value in big data is in
Analytics.
GARY KING
Thanks for your attention
Time for Q/A