A lack of trust is inhibiting the adoption of #AI. This presentation discusses approaches to delivering trusted data pipelines for AI and machine learning
This talk was presented by Mrs. Dorothea Wisemann, Department Head Cognitive Computing & Industry Solutions at IBM Resarch - Zurich, during Data Science Conference 4.0, as a keynote talk.
More info about Data Science Conference:
Website: http://datasciconference.com
Instagram: https://www.instagram.com/datasciconf/
Facebook: https://www.facebook.com/DataSciConference/
Twitter: https://twitter.com/datasciconf
Flickr: https://www.flickr.com/photos/data-science-conference
Explainable AI (XAI) is becoming Must-Have NFR for most AI enabled product or solution deployments. Keen to know viewpoints and collaboration opportunities.
Introductory presentation to Explainable AI, defending its main motivations and importance. We describe briefly the main techniques available in March 2020 and share many references to allow the reader to continue his/her studies.
Explainable AI makes the algorithms to be transparent where they interpret, visualize, explain and integrate for fair, secure and trustworthy AI applications.
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Analytics India Magazine
Most organizations understand the predictive power and the potential gains from AIML, but AI and ML are still now a black box technology for them. While deep learning and neural networks can provide excellent inputs to businesses, leaders are challenged to use them because of the complete blind faith required to ‘trust’ AI. In this talk we will use the latest technological developments from researchers, the US defense department, and the industry to unbox the black box and provide businesses a clear understanding of the policy levers that they can pull, why, and by how much, to make effective decisions?
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Sri Ambati
This meetup took place in Mountain View on January 24th, 2019.
Description:
With the effort and contributions from researchers and practitioners from academia and industry, Machine Learning Interpretation has become a young sub-field of ML. However, the norms around its definition and understanding is still in its infancy and there are numerous different approaches emerging rapidly. However, there seems to be a lack of a consistent explanation framework to evaluate and consistently benchmark different algorithms - evaluating against interpretation, completeness and consistency of the algorithms.
The idea with the gym is to provide a controlled interactive environment for all forms of Machine Learning algorithms, - initially focusing on supervised predictive modeling problems, to allow analysts and data-scientists to explore, debug and generate insightful understanding of the models by
1.Model Validation: Ways to explore and validate black box ML systems enabling model comparison both globally and locally - identifying biases in the training data through interpretation.
2.What-if Analysis: An interactive environment where communication can happen i.e. enable learning through interactions. User having the ability to conduct "What-If" analysis - effect of single or multiple features and their interactions
3.Model Debugging: Ways to analyze the misbehavior of the model by exploring counterfactual examples(adversarial examples and training)
4. Interpretable Models: Ability to build natively interpretable models - with the goal to simplify complex models to enable better understanding.
The central concept with MLI gym is to have an interactive environment where one could explore and simulate variations in the world(a world post a model is operationalized) beyond the defined model metrics point estimates - e.g. ROC-AUC, confusion matrix, RMSE, R2 score and others.
Speaker's Bio:
Pramit is a Lead Data Scientist/ at H2O.ai. His area of interests is building Statistical/Machine Learning models(Bayesian and Frequentist Modeling techniques) to help the business realize their data-driven goals.
Currently, he is exploring "Model Interpretation" as means to efficiently understand the true nature of predictive models to enable model robustness and security. He believes effective Model Inference coupled with Adversarial training could lead to building trustworthy models with known blind spots. He has started an open source project Skater: https://github.com/datascienceinc/Skater to solve the need for Model Inference(The project is still in its early stages of development but check it out, always eager for feedback)
Spark 2019: Equifax's SVP Data & Analytics, Peter Maynard, discusses the notion (and importance) of explainable AI in the financial services sector. He looks at the work Equifax have done to crack open the black box by creating patented AI technology that helps companies make smarter, explainable decisions using AI.
This tutorial extensively covers the definitions, nuances, challenges, and requirements for the design of interpretable and explainable machine learning models and systems in healthcare. We discuss many uses in which interpretable machine learning models are needed in healthcare and how they should be deployed. Additionally, we explore the landscape of recent advances to address the challenges model interpretability in healthcare and also describe how one would go about choosing the right interpretable machine learnig algorithm for a given problem in healthcare.
This talk was presented by Mrs. Dorothea Wisemann, Department Head Cognitive Computing & Industry Solutions at IBM Resarch - Zurich, during Data Science Conference 4.0, as a keynote talk.
More info about Data Science Conference:
Website: http://datasciconference.com
Instagram: https://www.instagram.com/datasciconf/
Facebook: https://www.facebook.com/DataSciConference/
Twitter: https://twitter.com/datasciconf
Flickr: https://www.flickr.com/photos/data-science-conference
Explainable AI (XAI) is becoming Must-Have NFR for most AI enabled product or solution deployments. Keen to know viewpoints and collaboration opportunities.
Introductory presentation to Explainable AI, defending its main motivations and importance. We describe briefly the main techniques available in March 2020 and share many references to allow the reader to continue his/her studies.
Explainable AI makes the algorithms to be transparent where they interpret, visualize, explain and integrate for fair, secure and trustworthy AI applications.
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Analytics India Magazine
Most organizations understand the predictive power and the potential gains from AIML, but AI and ML are still now a black box technology for them. While deep learning and neural networks can provide excellent inputs to businesses, leaders are challenged to use them because of the complete blind faith required to ‘trust’ AI. In this talk we will use the latest technological developments from researchers, the US defense department, and the industry to unbox the black box and provide businesses a clear understanding of the policy levers that they can pull, why, and by how much, to make effective decisions?
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Sri Ambati
This meetup took place in Mountain View on January 24th, 2019.
Description:
With the effort and contributions from researchers and practitioners from academia and industry, Machine Learning Interpretation has become a young sub-field of ML. However, the norms around its definition and understanding is still in its infancy and there are numerous different approaches emerging rapidly. However, there seems to be a lack of a consistent explanation framework to evaluate and consistently benchmark different algorithms - evaluating against interpretation, completeness and consistency of the algorithms.
The idea with the gym is to provide a controlled interactive environment for all forms of Machine Learning algorithms, - initially focusing on supervised predictive modeling problems, to allow analysts and data-scientists to explore, debug and generate insightful understanding of the models by
1.Model Validation: Ways to explore and validate black box ML systems enabling model comparison both globally and locally - identifying biases in the training data through interpretation.
2.What-if Analysis: An interactive environment where communication can happen i.e. enable learning through interactions. User having the ability to conduct "What-If" analysis - effect of single or multiple features and their interactions
3.Model Debugging: Ways to analyze the misbehavior of the model by exploring counterfactual examples(adversarial examples and training)
4. Interpretable Models: Ability to build natively interpretable models - with the goal to simplify complex models to enable better understanding.
The central concept with MLI gym is to have an interactive environment where one could explore and simulate variations in the world(a world post a model is operationalized) beyond the defined model metrics point estimates - e.g. ROC-AUC, confusion matrix, RMSE, R2 score and others.
Speaker's Bio:
Pramit is a Lead Data Scientist/ at H2O.ai. His area of interests is building Statistical/Machine Learning models(Bayesian and Frequentist Modeling techniques) to help the business realize their data-driven goals.
Currently, he is exploring "Model Interpretation" as means to efficiently understand the true nature of predictive models to enable model robustness and security. He believes effective Model Inference coupled with Adversarial training could lead to building trustworthy models with known blind spots. He has started an open source project Skater: https://github.com/datascienceinc/Skater to solve the need for Model Inference(The project is still in its early stages of development but check it out, always eager for feedback)
Spark 2019: Equifax's SVP Data & Analytics, Peter Maynard, discusses the notion (and importance) of explainable AI in the financial services sector. He looks at the work Equifax have done to crack open the black box by creating patented AI technology that helps companies make smarter, explainable decisions using AI.
This tutorial extensively covers the definitions, nuances, challenges, and requirements for the design of interpretable and explainable machine learning models and systems in healthcare. We discuss many uses in which interpretable machine learning models are needed in healthcare and how they should be deployed. Additionally, we explore the landscape of recent advances to address the challenges model interpretability in healthcare and also describe how one would go about choosing the right interpretable machine learnig algorithm for a given problem in healthcare.
Practical Explainable AI: How to build trustworthy, transparent and unbiased ...Raheel Ahmad
This presentation is from the Federated & Distributed Machine Learning Conference. This talk focuses on why we need explainable AI and how can we build models that are trustworthy, transparency and unbiased.
This was presented at the London Artificial Intelligence & Deep Learning Meetup.
https://www.meetup.com/London-Artificial-Intelligence-Deep-Learning/events/245251725/
Enjoy the recording: https://youtu.be/CY3t11vuuOM.
- - -
Kasia discussed complexities of interpreting black-box algorithms and how these may affect some industries. She presented the most popular methods of interpreting Machine Learning classifiers, for example, feature importance or partial dependence plots and Bayesian networks. Finally, she introduced Local Interpretable Model-Agnostic Explanations (LIME) framework for explaining predictions of black-box learners – including text- and image-based models - using breast cancer data as a specific case scenario.
Kasia Kulma is a Data Scientist at Aviva with a soft spot for R. She obtained a PhD (Uppsala University, Sweden) in evolutionary biology in 2013 and has been working on all things data ever since. For example, she has built recommender systems, customer segmentations, predictive models and now she is leading an NLP project at the UK’s leading insurer. In spare time she tries to relax by hiking & camping, but if that doesn’t work ;) she co-organizes R-Ladies meetups and writes a data science blog R-tastic (https://kkulma.github.io/).
https://www.linkedin.com/in/kasia-kulma-phd-7695b923/
Artificial Intelligence is increasingly playing an integral role in determining our day-to-day experiences. Moreover, with proliferation of AI based solutions in areas such as hiring, lending, criminal justice, healthcare, and education, the resulting personal and professional implications of AI are far-reaching. The dominant role played by AI models in these domains has led to a growing concern regarding potential bias in these models, and a demand for model transparency and interpretability. In addition, model explainability is a prerequisite for building trust and adoption of AI systems in high stakes domains requiring reliability and safety such as healthcare and automated transportation, and critical industrial applications with significant economic implications such as predictive maintenance, exploration of natural resources, and climate change modeling.
As a consequence, AI researchers and practitioners have focused their attention on explainable AI to help them better trust and understand models at scale. The challenges for the research community include (i) defining model explainability, (ii) formulating explainability tasks for understanding model behavior and developing solutions for these tasks, and finally (iii) designing measures for evaluating the performance of models in explainability tasks.
In this tutorial, we present an overview of model interpretability and explainability in AI, key regulations / laws, and techniques / tools for providing explainability as part of AI/ML systems. Then, we focus on the application of explainability techniques in industry, wherein we present practical challenges / guidelines for effectively using explainability techniques and lessons learned from deploying explainable models for several web-scale machine learning and data mining applications. We present case studies across different companies, spanning application domains such as search & recommendation systems, sales, lending, and fraud detection. Finally, based on our experiences in industry, we identify open problems and research directions for the data mining / machine learning community.
The document discusses explainability and bias in machine learning/AI models. It covers several topics:
1. Why explainability of models is important, including for laypeople using models and potential legal needs for explanations of decisions.
2. Methods for explainability including using interpretable models directly and post-hoc explainability methods like LIME and SHAP which provide feature attributions.
3. Issues with bias in machine learning models and different definitions of fairness. It also discusses techniques for measuring and mitigating bias, such as reweighting data or using adversarial learning.
The document discusses predictive analytics techniques including data preparation, modeling, and model monitoring. It describes preparing data through transformation, deriving behavioral variables, and quality checks. Modeling techniques covered include decision trees, regression, neural networks, and ensemble modeling in SAS Enterprise Miner or other software. Model monitoring compares actual and predicted values, analyzes variable distributions in scored data, and monitors model performance metrics.
This is an introduction to text analytics for advanced business users and IT professionals with limited programming expertise. The presentation will go through different areas of text analytics as well as provide some real work examples that help to make the subject matter a little more relatable. We will cover topics like search engine building, categorization (supervised and unsupervised), clustering, NLP, and social media analysis.
This document provides an overview of AlgoAnalytics, an analytics consultancy company that uses advanced machine learning techniques. The summary is as follows:
(1) AlgoAnalytics provides predictive analytics solutions for retail, healthcare, financial services, and other industries using techniques like deep learning, natural language processing, and computer vision on structured, text, image and sound data.
(2) The CEO and founder, Aniruddha Pant, has over 20 years of experience applying mathematical techniques to business problems. Some of AlgoAnalytics' work includes recommender systems, demand prediction, image analysis, and customer churn prevention for online retail.
(3) Examples of AlgoAnalytics' predictive models shown include an
Text analytics is used to extract structured data from unstructured text sources like social media posts, reviews, emails and call center notes. It involves acquiring and preparing text data, processing and analyzing it using algorithms like decision trees, naive bayes, support vector machines and k-nearest neighbors to extract terms, entities, concepts and sentiment. The results are then visualized to support data-driven decision making for applications like measuring customer opinions and providing search capabilities. Popular tools for text analytics include RapidMiner, KNIME, SPSS and R.
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesDerek Kane
This is the first lecture in a series of data analytics topics and geared to individuals and business professionals who have no understand of building modern analytics approaches. This lecture provides an overview of the models and techniques we will address throughout the lecture series, we will discuss Business Intelligence topics, predictive analytics, and big data technologies. Finally, we will walk through a simple yet effective example which showcases the potential of predictive analytics in a business context.
How ml can improve purchase conversionsSudeep Shukla
- What is Machine Learning and what problems can it solve?
- Basic Machine Learning models
- Data gathering and data cleaning
- Parameters for judging whether the model is performing well?
- Making it easy for sales & marketing teams to use the ML program
As AI becomes more and more prevalent in our lives, the decisions it makes for us are becoming more and more impactful on our lives and those of others.
How can we help people to have trust in the models we're building? The field of Explainable AI focuses on making any machine learning model interpretable by non experts.
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYCSri Ambati
This talk was recorded in NYC on October 22nd, 2019 and can be viewed here: https://youtu.be/ZGSEDv8hqHY
The Case for Model Debugging
Prediction by machine learning models is fundamentally the execution of computer code. Like all good code, machine learning models should be debugged for logical or runtime errors or for security vulnerabilities. Recent, high-profile failures have made it clear that machine learning models must also be debugged for disparate impact across demographic segments and other types of sociological bias. Model debugging enhances trust in machine learning directly by increasing accuracy in new or holdout data, by decreasing or identifying hackable attack surfaces, or by decreasing sociological bias. As a side-effect, model debugging should also increase understanding and explainability of model mechanisms and predictions. This presentation outlines several standard and newer model debugging techniques and proposes several potential remediation methods for any discovered bugs. Discussed debugging techniques include adversarial examples, benchmark models, partial dependence and individual conditional expectation, random attacks, Shapley explanations of predictions and residuals, and models of residuals. Proposed remediation approaches include alternate models, editing of deployable model artifacts, missing value injection, prediction assertions, and regularization methods.
Bio: Patrick Hall is the Senior Director of Product at H2O.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Prior to joining H2O.ai, Patrick held global customer-facing roles and research and development roles at SAS Institute.
Machine learning projects may seem similar to any software engineering endeavor, the reality is machine learning projects are onerous, demand high quality work from every person involved, and are sensitive to any tiny mistake.
It seems that we cannot go five years without having some massive technology shift that becomes an essential part of our day-to-day lives. So, we will start with a proper definition of machine learning and how it is changing the way businesses analyze information. We will then continue by discussing proper ways to begin machine learning projects, including weighing the feasibility of a project, planning timelines, and the stages of the machine learning workflow once you start your project.
After exploring the stages of the machine learning workflow, we will end the webinar with an example of a completed machine learning project. We will demonstrate how to create a similar project and give you the tools to create your own.
What you'll learn:
A deeper understanding of the end-to-end machine learning workflow.
The tools needed to effectively create, design, and manage machine learning projects.
The skills to define your goal, foresee issues, release models, and measure outcomes during the ML project lifecycle.
Demo: Skyl Platform for End-End machine learning workflow.
This is the slide deck for this webinar:
https://skyl.ai/webinars/guide-end-to-end-machine-learning-projects
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...Madhav Mishra
The document discusses machine learning paradigms including supervised learning, unsupervised learning, clustering, artificial neural networks, and more. It then discusses how supervised machine learning works using labeled training data for tasks like classification and regression. Unsupervised learning is described as using unlabeled data to find patterns and group data. Semi-supervised learning uses some labeled and some unlabeled data. Reinforcement learning provides rewards or punishments to achieve goals. Inductive learning infers functions from examples to make predictions for new examples.
This document discusses the unrealized power of data and predictive analytics. It begins by highlighting how predictive analytics can be used for forecasting, targeting customers, fraud detection, risk assessment, customer churn prediction, and price elasticity analysis. It then provides examples of predictive analytics in action in various industries like healthcare, education, law enforcement, and human resources. The document emphasizes that predictive analytics must become simpler to use and be integrated into business processes. It outlines the data science process and importance of data wrangling. Finally, it discusses Microsoft's CloudML Studio and Data Lab products for building predictive models using machine learning algorithms and analyzing customer data to predict things like equipment failures and customer churn.
This presentation covers an overview of Analytics and Machine learning. It also covers the Microsoft's contribution in Machine learning space. Azure ML Studio, a SaaS based portal to create, experiment and share Machine Learning Solutions to the external world.
To download please go to: http://www.intelligentmining.com/knowledge-base.html
Slides as presented by Alex Lin to the NYC Predictive Analytics Meetup group: http://www.meetup.com/NYC-Predictive-Analytics/ on April 1, 2010 (no joke!) :)
This document discusses various aspects of data labeling for machine learning tasks. It addresses how people apply labels to communicate concepts and how labeling is a learned behavior that varies by individual. The document then discusses using non-expert taggers for simpler labeling tasks and expert taggers for more specialized domains. It emphasizes the importance of data quality and reliability for machine learning and describes various metrics like Cohen's Kappa and Krippendorff's Alpha for measuring inter-rater agreement between taggers. The document also discusses approaches for handling multiple taggers, like majority vote and Expectation Maximization, which can correct for individual tagger biases.
This document discusses streaming data processing and the adoption of scalable frameworks and platforms for handling streaming or near real-time analysis and processing over the next few years. These platforms will be driven by the needs of large-scale location-aware mobile, social and sensor applications, similar to how Hadoop emerged from large-scale web applications. The document also references forecasts of over 50 billion intelligent devices by 2015 and 275 exabytes of data per day being sent across the internet by 2020, indicating challenges around data of extreme size and the need for rapid processing.
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
Many data scientists are well grounded in creating accomplishment in the enterprise, but many come from outside – from academia, from PhD programs and research. They have the necessary technical skills, but it doesn’t count until their product gets to production and in use. The speaker recently helped a struggling data scientist understand his organization and how to create success in it. That turned into this presentation, because many new data scientists struggle with the complexities of an enterprise.
Operationalize analytics through modern data strategyNagarro
This document discusses the need for companies to operationalize analytics through a modern data strategy. It outlines key drivers of innovation like customers, competitors and regulators that necessitate such a strategy. It then discusses challenges of existing systems related to data volume, structure and regulations. The document proposes a modern data architecture with three pillars - people, process and technology. It provides an example framework for an enterprise data strategy and references Nagarro's capabilities in big data and analytics.
Practical Explainable AI: How to build trustworthy, transparent and unbiased ...Raheel Ahmad
This presentation is from the Federated & Distributed Machine Learning Conference. This talk focuses on why we need explainable AI and how can we build models that are trustworthy, transparency and unbiased.
This was presented at the London Artificial Intelligence & Deep Learning Meetup.
https://www.meetup.com/London-Artificial-Intelligence-Deep-Learning/events/245251725/
Enjoy the recording: https://youtu.be/CY3t11vuuOM.
- - -
Kasia discussed complexities of interpreting black-box algorithms and how these may affect some industries. She presented the most popular methods of interpreting Machine Learning classifiers, for example, feature importance or partial dependence plots and Bayesian networks. Finally, she introduced Local Interpretable Model-Agnostic Explanations (LIME) framework for explaining predictions of black-box learners – including text- and image-based models - using breast cancer data as a specific case scenario.
Kasia Kulma is a Data Scientist at Aviva with a soft spot for R. She obtained a PhD (Uppsala University, Sweden) in evolutionary biology in 2013 and has been working on all things data ever since. For example, she has built recommender systems, customer segmentations, predictive models and now she is leading an NLP project at the UK’s leading insurer. In spare time she tries to relax by hiking & camping, but if that doesn’t work ;) she co-organizes R-Ladies meetups and writes a data science blog R-tastic (https://kkulma.github.io/).
https://www.linkedin.com/in/kasia-kulma-phd-7695b923/
Artificial Intelligence is increasingly playing an integral role in determining our day-to-day experiences. Moreover, with proliferation of AI based solutions in areas such as hiring, lending, criminal justice, healthcare, and education, the resulting personal and professional implications of AI are far-reaching. The dominant role played by AI models in these domains has led to a growing concern regarding potential bias in these models, and a demand for model transparency and interpretability. In addition, model explainability is a prerequisite for building trust and adoption of AI systems in high stakes domains requiring reliability and safety such as healthcare and automated transportation, and critical industrial applications with significant economic implications such as predictive maintenance, exploration of natural resources, and climate change modeling.
As a consequence, AI researchers and practitioners have focused their attention on explainable AI to help them better trust and understand models at scale. The challenges for the research community include (i) defining model explainability, (ii) formulating explainability tasks for understanding model behavior and developing solutions for these tasks, and finally (iii) designing measures for evaluating the performance of models in explainability tasks.
In this tutorial, we present an overview of model interpretability and explainability in AI, key regulations / laws, and techniques / tools for providing explainability as part of AI/ML systems. Then, we focus on the application of explainability techniques in industry, wherein we present practical challenges / guidelines for effectively using explainability techniques and lessons learned from deploying explainable models for several web-scale machine learning and data mining applications. We present case studies across different companies, spanning application domains such as search & recommendation systems, sales, lending, and fraud detection. Finally, based on our experiences in industry, we identify open problems and research directions for the data mining / machine learning community.
The document discusses explainability and bias in machine learning/AI models. It covers several topics:
1. Why explainability of models is important, including for laypeople using models and potential legal needs for explanations of decisions.
2. Methods for explainability including using interpretable models directly and post-hoc explainability methods like LIME and SHAP which provide feature attributions.
3. Issues with bias in machine learning models and different definitions of fairness. It also discusses techniques for measuring and mitigating bias, such as reweighting data or using adversarial learning.
The document discusses predictive analytics techniques including data preparation, modeling, and model monitoring. It describes preparing data through transformation, deriving behavioral variables, and quality checks. Modeling techniques covered include decision trees, regression, neural networks, and ensemble modeling in SAS Enterprise Miner or other software. Model monitoring compares actual and predicted values, analyzes variable distributions in scored data, and monitors model performance metrics.
This is an introduction to text analytics for advanced business users and IT professionals with limited programming expertise. The presentation will go through different areas of text analytics as well as provide some real work examples that help to make the subject matter a little more relatable. We will cover topics like search engine building, categorization (supervised and unsupervised), clustering, NLP, and social media analysis.
This document provides an overview of AlgoAnalytics, an analytics consultancy company that uses advanced machine learning techniques. The summary is as follows:
(1) AlgoAnalytics provides predictive analytics solutions for retail, healthcare, financial services, and other industries using techniques like deep learning, natural language processing, and computer vision on structured, text, image and sound data.
(2) The CEO and founder, Aniruddha Pant, has over 20 years of experience applying mathematical techniques to business problems. Some of AlgoAnalytics' work includes recommender systems, demand prediction, image analysis, and customer churn prevention for online retail.
(3) Examples of AlgoAnalytics' predictive models shown include an
Text analytics is used to extract structured data from unstructured text sources like social media posts, reviews, emails and call center notes. It involves acquiring and preparing text data, processing and analyzing it using algorithms like decision trees, naive bayes, support vector machines and k-nearest neighbors to extract terms, entities, concepts and sentiment. The results are then visualized to support data-driven decision making for applications like measuring customer opinions and providing search capabilities. Popular tools for text analytics include RapidMiner, KNIME, SPSS and R.
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesDerek Kane
This is the first lecture in a series of data analytics topics and geared to individuals and business professionals who have no understand of building modern analytics approaches. This lecture provides an overview of the models and techniques we will address throughout the lecture series, we will discuss Business Intelligence topics, predictive analytics, and big data technologies. Finally, we will walk through a simple yet effective example which showcases the potential of predictive analytics in a business context.
How ml can improve purchase conversionsSudeep Shukla
- What is Machine Learning and what problems can it solve?
- Basic Machine Learning models
- Data gathering and data cleaning
- Parameters for judging whether the model is performing well?
- Making it easy for sales & marketing teams to use the ML program
As AI becomes more and more prevalent in our lives, the decisions it makes for us are becoming more and more impactful on our lives and those of others.
How can we help people to have trust in the models we're building? The field of Explainable AI focuses on making any machine learning model interpretable by non experts.
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYCSri Ambati
This talk was recorded in NYC on October 22nd, 2019 and can be viewed here: https://youtu.be/ZGSEDv8hqHY
The Case for Model Debugging
Prediction by machine learning models is fundamentally the execution of computer code. Like all good code, machine learning models should be debugged for logical or runtime errors or for security vulnerabilities. Recent, high-profile failures have made it clear that machine learning models must also be debugged for disparate impact across demographic segments and other types of sociological bias. Model debugging enhances trust in machine learning directly by increasing accuracy in new or holdout data, by decreasing or identifying hackable attack surfaces, or by decreasing sociological bias. As a side-effect, model debugging should also increase understanding and explainability of model mechanisms and predictions. This presentation outlines several standard and newer model debugging techniques and proposes several potential remediation methods for any discovered bugs. Discussed debugging techniques include adversarial examples, benchmark models, partial dependence and individual conditional expectation, random attacks, Shapley explanations of predictions and residuals, and models of residuals. Proposed remediation approaches include alternate models, editing of deployable model artifacts, missing value injection, prediction assertions, and regularization methods.
Bio: Patrick Hall is the Senior Director of Product at H2O.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Prior to joining H2O.ai, Patrick held global customer-facing roles and research and development roles at SAS Institute.
Machine learning projects may seem similar to any software engineering endeavor, the reality is machine learning projects are onerous, demand high quality work from every person involved, and are sensitive to any tiny mistake.
It seems that we cannot go five years without having some massive technology shift that becomes an essential part of our day-to-day lives. So, we will start with a proper definition of machine learning and how it is changing the way businesses analyze information. We will then continue by discussing proper ways to begin machine learning projects, including weighing the feasibility of a project, planning timelines, and the stages of the machine learning workflow once you start your project.
After exploring the stages of the machine learning workflow, we will end the webinar with an example of a completed machine learning project. We will demonstrate how to create a similar project and give you the tools to create your own.
What you'll learn:
A deeper understanding of the end-to-end machine learning workflow.
The tools needed to effectively create, design, and manage machine learning projects.
The skills to define your goal, foresee issues, release models, and measure outcomes during the ML project lifecycle.
Demo: Skyl Platform for End-End machine learning workflow.
This is the slide deck for this webinar:
https://skyl.ai/webinars/guide-end-to-end-machine-learning-projects
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...Madhav Mishra
The document discusses machine learning paradigms including supervised learning, unsupervised learning, clustering, artificial neural networks, and more. It then discusses how supervised machine learning works using labeled training data for tasks like classification and regression. Unsupervised learning is described as using unlabeled data to find patterns and group data. Semi-supervised learning uses some labeled and some unlabeled data. Reinforcement learning provides rewards or punishments to achieve goals. Inductive learning infers functions from examples to make predictions for new examples.
This document discusses the unrealized power of data and predictive analytics. It begins by highlighting how predictive analytics can be used for forecasting, targeting customers, fraud detection, risk assessment, customer churn prediction, and price elasticity analysis. It then provides examples of predictive analytics in action in various industries like healthcare, education, law enforcement, and human resources. The document emphasizes that predictive analytics must become simpler to use and be integrated into business processes. It outlines the data science process and importance of data wrangling. Finally, it discusses Microsoft's CloudML Studio and Data Lab products for building predictive models using machine learning algorithms and analyzing customer data to predict things like equipment failures and customer churn.
This presentation covers an overview of Analytics and Machine learning. It also covers the Microsoft's contribution in Machine learning space. Azure ML Studio, a SaaS based portal to create, experiment and share Machine Learning Solutions to the external world.
To download please go to: http://www.intelligentmining.com/knowledge-base.html
Slides as presented by Alex Lin to the NYC Predictive Analytics Meetup group: http://www.meetup.com/NYC-Predictive-Analytics/ on April 1, 2010 (no joke!) :)
This document discusses various aspects of data labeling for machine learning tasks. It addresses how people apply labels to communicate concepts and how labeling is a learned behavior that varies by individual. The document then discusses using non-expert taggers for simpler labeling tasks and expert taggers for more specialized domains. It emphasizes the importance of data quality and reliability for machine learning and describes various metrics like Cohen's Kappa and Krippendorff's Alpha for measuring inter-rater agreement between taggers. The document also discusses approaches for handling multiple taggers, like majority vote and Expectation Maximization, which can correct for individual tagger biases.
This document discusses streaming data processing and the adoption of scalable frameworks and platforms for handling streaming or near real-time analysis and processing over the next few years. These platforms will be driven by the needs of large-scale location-aware mobile, social and sensor applications, similar to how Hadoop emerged from large-scale web applications. The document also references forecasts of over 50 billion intelligent devices by 2015 and 275 exabytes of data per day being sent across the internet by 2020, indicating challenges around data of extreme size and the need for rapid processing.
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
Many data scientists are well grounded in creating accomplishment in the enterprise, but many come from outside – from academia, from PhD programs and research. They have the necessary technical skills, but it doesn’t count until their product gets to production and in use. The speaker recently helped a struggling data scientist understand his organization and how to create success in it. That turned into this presentation, because many new data scientists struggle with the complexities of an enterprise.
Operationalize analytics through modern data strategyNagarro
This document discusses the need for companies to operationalize analytics through a modern data strategy. It outlines key drivers of innovation like customers, competitors and regulators that necessitate such a strategy. It then discusses challenges of existing systems related to data volume, structure and regulations. The document proposes a modern data architecture with three pillars - people, process and technology. It provides an example framework for an enterprise data strategy and references Nagarro's capabilities in big data and analytics.
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented What Data Do You Have and Where is it?
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Precisely
Making the hurdle from designing a machine learning model to putting it into production is the key to getting value back, and the roadblock that stops many a promising machine learning project. After the data scientists have done their part, engineering robust production data pipelines has its own set of tough problems to solve. Syncsort software helps the data engineer every step of the way.
Once you’ve got data pulled in from multiple sources, you need to assess the mess. In nearly every data set, there will be flaws. Missing data, misspelled data, misfielded data, dozens of common problems that need to be repaired before the data is ready to use. The data quality software that has been on the market for years is the obvious choice, since it already has the full toolset to assess the problems you’re up against and correct them. Unfortunately, most data quality software was built in the age of single server data warehouses and doesn’t scale to cluster-sized problems. It is also, traditionally, far too slow to support the kind of real-time use cases that drive the machine learning world.
When Syncsort bought Trillium, the industry leader in data quality software for over a decade, we combined Trillium Quality with Intelligent Execution, our artificially intelligent dynamic optimizer that provides excellent performance on MapReduce or Spark. Rather than coding everything from scratch and reinventing the data quality wheel, view this short webinar on-demand to learn how you can feed production machine learning models with shiny clean data while spending zero time on coding and performance tuning. These fifteen minutes could save you weeks.
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Precisely
The advanced analytics and AI that run today’s businesses rely on a larger volume, and greater variety, of data. This data needs to be of the highest quality to ensure the best possible outcomes, but traditional data quality tools weren’t designed for today’s modern data environments.
That’s why we’ve developed Trillium DQ for Big Data -- an integrated product that delivers industry-leading data profiling and data quality at scale, in the cloud or on premises.
In this on-demand webcast, you will learn how Trillium DQ:
• Empowers data analysts to easily profile large, diverse data sources to discover new insights, uncover issues, and report on their findings – all without involving IT.
• Delivers best-in-class entity resolution to support mission-critical applications such as Customer 360, fraud detection, AML, and predictive analytics.
• Supports Cloud and hybrid architectures by providing consistent high-performance processing within critical time windows on all platforms.
• Keeps enterprise data lakes validated, clean, and trusted with the highest quality data – without technical expertise in big data or distributed architectures.
• Enables data quality monitoring based on targeted business rules for data governance and business insight
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackPrecisely
With recent studies indicating that 80% of AI and machine learning projects are failing due to data quality related issues, it’s critical to think holistically about this fact. This is not a simple topic – issues in data quality can occur throughout from starting the project through to model implementation and usage.
View this webinar on-demand, where we start with four foundational data steps to get our AI and ML projects grounded and underway, specifically:
• Framing the business problem
• Identifying the “right” data to collect and work with
• Establishing baselines of data quality through data profiling and business rules
• Assessing fitness for purpose for training and evaluating the subsequent models and algorithms
This document discusses balancing data governance and innovation. It describes how traditional data analytics methods can inhibit innovation by requiring lengthy processes to analyze new data. The document advocates adopting a data lake approach using tools like Hadoop and Spark to allow for faster ingestion and analysis of diverse data types. It also discusses challenges around simultaneously enabling innovation through a data lake while still maintaining proper data governance, security, and quality. Achieving this balance is key for organizations to leverage data for competitive advantage.
Achieving a Single View of Business – Critical Data with Master Data ManagementDATAVERSITY
This document discusses achieving a single view of critical business data through master data management (MDM). It outlines how MDM can consolidate data from various internal and external sources to provide a centralized, trusted view across different business domains. The key benefits of MDM include improved data quality, governance and compliance. It also enables contextual insights and more informed decision-making through cross-domain intelligence and analytics. Successful MDM requires flexible technologies, processes and organizational support to ensure data governance and deliver ongoing value.
Introduction to Data Science (Data Summit, 2017)Caserta
This document summarizes an introduction to data science presentation by Joe Caserta and Bill Walrond of Caserta Concepts. Caserta Concepts is an internationally recognized data innovation and engineering consulting firm. The agenda covers why data science is important, challenges of working with big data, governing big data, the data pyramid, what data scientists do, standards for data science, and a demonstration of data analysis. Popular machine learning algorithms like regression, decision trees, k-means clustering and collaborative filtering are also discussed.
Data Profiling: The First Step to Big Data QualityPrecisely
Big data offers the promise of a data-driven business model generating new revenue and competitive advantage fueled by new business insights, AI, and machine learning. Yet without high quality data that provides trust, confidence, and understanding, business leaders continue to rely on gut instinct to drive business decisions.
The critical foundation and first step to deliver high quality data in support of a data-driven view that truly leverages the value of big data is data profiling - a proven capability to analyze the actual data content and help you understand what's really there.
View this webinar on-demand to learn five core concepts to effectively apply data profiling to your big data, assess and communicate the quality issues, and take the first step to big data quality and a data-driven business.
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackPrecisely
When consolidating multiple sources of information from across your organization, how do you find the records that relate to the same customer, the same company or the same product? This is the challenge faced by many businesses today when putting a data lake to work. The problem is made far worse when different systems may not have the same contact entered the same way. Is Bob Smith the same as Robert Smith? How about Dr. Robert L. Smith - is he the same person? What about Syncsort, Inc and Sinksort Corp.? Are those the same company? One must compare each individual record to every other record in the dataset with some very sophisticated matching algorithms to determine who is who, and you may have to compare the data multiple times in multiple ways to resolve each entity.
Just to add to the difficulty, let’s say your organization has very large volumes of records in your data lake - you don’t have to compare a thousand records to a thousand other records multiple times - you must compare a million to a million, or 100 million to 100 million. This kind of compute intensive comparison can bring even a powerful cluster to its knees.
This is a problem Syncsort customers must solve, and we have developed some very powerful and intelligent software to tackle it.
View this presentation as we discuss the challenges of entity resolution at scale, how Syncsort’s Trillium data quality software line has tackled them successfully in production clusters and see a demonstration of this software in action.
Data Virtualization for Compliance – Creating a Controlled Data EnvironmentDenodo
CIT modernized its data architecture in response to intense regulatory scrutiny. In this presentation, they present how data virtualization is being used to drive standardization, enable cross-company data integration, and serve as a common provisioning point from which to access all authoritative sources of data.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/CCqUeT.
This document discusses trends in data analytics. It begins by defining big data and how it differs from traditional data approaches in terms of size, techniques, and ability to solve new problems. It then provides examples of big data applications across various industries like retail, automotive, healthcare, and insurance. Specifically, it outlines how big data is used for predictive analytics, personalization, fraud detection, and risk adjustment. Finally, it discusses some risks of big data like privacy issues and ensuring the right problems are addressed.
This document discusses big data and the importance of data quality for big data initiatives. It defines big data as large, diverse digital data sets that require new techniques to enable capture, storage, analysis and visualization. The key challenges of big data include integrating diverse structured and unstructured data sources and ensuring high quality data. The document emphasizes that poor data quality can undermine big data analytics efforts and lead to wrong insights. It promotes establishing a data quality framework including profiling, standardization, matching and enrichment to enable valid big data analytics.
This document discusses the opportunities and challenges of big data. It defines big data as huge volumes of structured and unstructured data from various sources that require new tools to analyze and extract business insights. Big data provides both statistical and predictive views to help businesses make smarter decisions. While big data allows companies to integrate diverse data sources and gain real-time insights, challenges include processing large and complex data volumes and ensuring data quality, privacy and management. The document outlines the big data lifecycle and how analytics can be used descriptively, predictively and prescriptively.
The presentation includes the introduction to the topic, the various dimensions of big data, its evolution from big data 1.0 to bid data 3.0 and its impact on various industries, uses as well as the challenges it faces. The concluding slide gives a brief on the future of big data.
Building Your Enterprise Data Marketplace with DMX-hPrecisely
In the past few years third-party data marketplaces, often provided as Data as a Service, have taken off. But most organizations already own the data most relevant to their business – data pertaining to their own customers, transactions, products, etc.
That’s why the most successful organizations are applying the concepts of external data markets to create their own enterprise data marketplaces, where users can easily find and access data from across the company that is clean, trustworthy and auditable.
View this webinar on-demand to learn how to build an enterprise data marketplace of your own with DMX-h! We'll cover:
• Attributes of a successful enterprise data marketplace
• Potential roadblocks, and how to overcome them
• Examples of customers who have successfully built data marketplaces with DMX-h
The New Trillium DQ: Big Data Insights When and Where You Need ThemPrecisely
Organizations are increasingly challenged to deliver on new initiatives with more data sources and higher volumes of data across divergent, hybrid architectures. With this enterprise challenge in mind, Syncsort introduces Trillium DQ version 16 bringing the full range of data quality functionality forward into a highly scalable, natively executed framework that works on both traditional and distributed platforms to ensure consistency of processing while achieving the performance necessary for today’s workloads and data volumes.
This webcast highlights the capabilities of Trillium DQ v16 with a focus on its highly scalable, distributed architecture.
View this webinar on-demand to learn:
• How Trillium Discovery provides easy-to-use insight into Big Data, relational, and text-based data sources for rapid understanding of your data sources
• How Trillium Quality delivers high-scale, high-performance execution for critical data quality processes including global data enrichment and multi-domain entity resolution
ADV Slides: Data Pipelines in the Enterprise and ComparisonDATAVERSITY
Despite the many, varied, and legitimate data platforms that exist today, data seldom lands once in its perfect spot for the long haul of usage. Data is continually on the move in an enterprise into new platforms, new applications, new algorithms, and new users. The need for data integration in the enterprise is at an all-time high.
Solutions that meet these criteria are often called data pipelines. These are designed to be used by business users, in addition to technology specialists, for rapid turnaround and agile needs. The field is often referred to as self-service data integration.
Although the stepwise Extraction-Transformation-Loading (ETL) remains a valid approach to integration, ELT, which uses the power of the database processes for transformation, is usually the preferred approach. The approach can often be schema-less and is frequently supported by the fast Apache Spark back-end engine, or something similar.
In this session, we look at the major data pipeline platforms. Data pipelines are well worth exploring for any enterprise data integration need, especially where your source and target are supported, and transformations are not required in the pipeline.
The document discusses avoiding compliance pitfalls related to anti-money laundering (AML) regulations. It recommends establishing a data management system, conducting organizational screening against legislations, and reporting suspicious activity. It also warns of primary failings seen in identifying customer details, conducting sanctions screening, and failing to report qualifying transactions, which can damage an organization's reputation. Effective information management that supports AML and fraud prevention includes having data governance, quality, and master data management practices.
Studies show that poor data quality has a negaitve impact on customer experience, analytics and marketing.
This presentation discusses solutions to the problem of poor customer data quality
Get the survey results http://www.masterdata.co.za/index.php/whitepapers/file/77-whitepaper-extracting-marketing-value-from-big-data
Moving from passive to active data governanceGary Allemann
Presentation given at the 2015 South African Data Management Association conference.
Check out our blog.masterdata.co.za for articles related to this pressie - coming over the next few weeks, or call us on +27114854856 for more information
Using gis to enhance customer experienceGary Allemann
The document discusses how geographic information systems (GIS) can enhance the customer experience. It provides examples of how telecommunications, insurance, and government organizations have used big data and GIS analytics to gain insights into customer interactions and preferences. This allows them to better meet customer needs, detect fraud, optimize networks in high usage areas, and improve delivery of social services. Location data, transaction records, and demographics are analyzed to understand customers and identify discrepancies or coverage gaps to target for improvement.
Data is becoming increasingly important for powering operations, nurturing decisions, and sustaining competitive advantage. As data becomes more central to business success, organizations require a chief data officer to ensure data meets business needs and is properly governed, moving from a paradigm of "data first" to encourage data discovery while maintaining business control over analytics. A chief data officer may become responsible for overseeing big data strategy and management as the role of data grows in importance.
Big data myths are busted in this document which outlines common misconceptions about big data and provides guidance on where to start with big data initiatives. Some myths that are dispelled are that big data is only about external data, size, specific technologies like Hadoop, and that it will solve all data quality problems. The document recommends taking an agile analytics approach starting with identifying use cases then integrating, preparing, analyzing and visualizing data to deploy solutions in under 4 weeks.
What is the value of data? Data governance must look beyond master data to deliver real value.
Visit www,masterdata.co.za/index.php/data-governance-solutions
Bridging the gap between relational and spatial data
How data quality links customer to spatial data sets see http://www.masterdata.co.za/index.php/geocoding-cres
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
5. Data delivers competitive advantage
“Compared with their peers, high
performers report a greater variety
of actions to monetize data – with
greater revenue impact”
- McKinsey Global Survey: Fueling growth through data
monetization
“73.2%
Percentage of executives whose firms
have achieved measurable results from
Big Data and AI investments
- NewVantage Partners Big Data Executive Survey 2018
$1.8 Trillion
Projected annual revenue for
insights-driven businesses by 2021
- “Insights-Driven Businesses Set the Pace for Global
Growth,” Forrester, October 19, 2018
“85%
Firms that leverage customer behavioral
insights outperform peers by 85 percent
in sales growth and 25 percent in gross
margin
- McKinsey Global Survey: Capturing value from your
customer data
6. Common machine learning applications
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
7. Why do you have a data lake?
Syncsort 2019 data trends survey
Analytics Use Cases
Drive Data Lakes
and Enterprise
Data Hubs
8. Most organisations not getting full value
Syncsort 2019 data trends survey
91% of organizations
have not yet reached
a “transformational”
level of maturity in
data and analytics
- Gartner
68% of IT professionals
state that data silos
negatively impact their
organization’s ability to
get value from their data
• Every part of the
business demands
sophisticated data
analysis
• Departments need
access to the
company’s many data
sets, combined in
different ways
• IT can’t be a bottleneck
• Data has outgrown the
data warehouse
• Data lakes can be
polluted and chaotic
• Data is inconsistent
across data marts
9. Key challenges
Syncsort 2019 data trends survey
only 9% “very effective” in
getting value from data
IT decision makers waste 2 hours
daily looking for relevant data
10. 3 pronged approach
Make data easier to
find and understand
Flexible data pipe lines Debug your data
• Manage bias
• Manage data quality
at scale
• Governance /
Traceability
• Batch and streaming
• Legacy, big data and
cloud
• Data governance
• Data catalog
11. Data Architecture
Metadata/Data Modelling
Data Security
Data
Integration
MDM/ReferenceData
DataQuality
DataGovernance
Business
Intelligemce
DataWarehouse
BigData
AIandML
Business-driven
IT-driven
13. Data Governance and Catalog
AI, Big Data, and Data Governance // Stan Christiaens, Collibra
(FirstMark's Data Driven)
14. Data Governance and Catalog
AI, Big Data, and Data Governance // Stan Christiaens, Collibra (FirstMark's
Data Driven)
• The differentiator for #AI is DATA
• Bias is like “a snake in the data grass”
• Finding data is a “people and process” problem
• Data (if you treat it as a strategic asset) should
have its own business process
16. Data Scientist
• Expert in statistical analysis, machine
learning techniques, finding answers to
business questions buried in datasets.
• Does NOT want to spend 50 – 90% of their
time tinkering with data, getting it into
good shape to train models – but
frequently does, especially if there’s no
data engineer on their team.
• When machine learning model is trained,
tested, and proven it will accomplish the
goal, turns it over to data engineer to
productionize. Not skilled at taking the
model from a test sandbox into
production, especially not at large scale.
Data Engineer
• Expert in data structures, data
manipulation, and constructing production
data pipelines.
• WANTS to spend all of their time working
with data, but usually has more on their
plate than they can keep up with. Anything
that will speed up their work is helpful.
• In most successful companies, is involved
from the beginning. First gathers, cleans
and standardizes data, helps data scientist
with feature engineering, provides top
notch data, ready to train models.
• After model is tested, builds robust high
scale, data pipelines to feed the models
the data they need in the correct format in
production to provide ongoing business
value.
Data Engineer to the rescue
17. Identify and onboard all relevant data
Data Lake or Cloud
Raw Landing Zone
Access & Onboard – Elect to include data to understand
• What you don’t know CAN hurt you – e.g. bias
• If you’ve left it out, you cannot know it exists
• Data sets have more power to predict when combined
18. Ensure the quality
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Refine – cleanse, enrich, de-duplicate
• What data needs refinement? – use cases will determine
• Each data set should be refined once – don’t repeat work
19. Understand provenanc
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Track Provenance
• Data lineage documentation is necessary for establishing data can be
trusted, and for auditing, regulatory compliance
• Also, useful for reproducing steps in production machine learning
data pipelines
20. Enrich and grow
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Shop for data sets, features & validate against your questions
• Analyst, data scientist shops for data
• What do I need for my purpose?
• Quality is already assured, provenance documented
• Improves trust, saves time
21. 1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS,
web clicks, etc. all in incompatible formats, making it difficult to gather and
prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at
scale. Most data quality tools are not designed to work on that scale of data.
3. Entity Resolution
Distinguishing matches across massive datasets that indicate a single specific
entity (person, company, product, etc.) requires sophisticated multi-field
matching algorithms and a lot of compute power. Essentially everything has to
be compared to everything else.
4. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in
production, in order for models to accurately make predictions on new data,
and for required audit trails. Capture of complete lineage, from source to end
point is needed.
Challenges of Engineering
Modern Data Pipelines
22. Onboard any data
22
Data
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Data Sources Data Lake
24. Hybrid and Multi-
Cloud
Strategies
• Ensure seamless data flow
to/from cloud, and among clouds
• Maximize choice for workload
optimization and interoperability
• Design once, deploy anywhere –
on premise and in the cloud
• Optimize cloud infrastructure for
cost and efficiency
• Minimize disruption and risk
• Build new skills to handle
different and emerging portfolios
Challenges
• Managing multiple clouds and
vendors
• Integrating data and applications
on-premise to cloud, across clouds
• Avoiding cloud lock-in
• Lack of skills to handle hybrid
multi-cloud world
• Cloud native or cloud first
for new applications
• Scalability and elasticity
• Hybrid: on-premises systems
and public and private
clouds
• Multi-cloud
• Cloud increases focus on
business process from tech
details
25. Seamlessly flow data to, from
and among clouds
Design Once, Deploy Anywhere – Public cloud, Private Cloud, Multi-Cloud, Hybrid or On-Prem
• Build a modern data pipeline with flexibility, agility
and elasticity
• Simplify accessing, integrating, governing your data
in a single software environment
• Get the most from the Cloud – no silos, no lock-in, no
re-work
• Move to/from on-premise to Cloud, or between
Clouds with no re-design, re-compile, no re-work
ever!
• Get excellent performance every time – without
tuning, load balancing, etc.
• Future-proof your applications
26. • Cleanse, enrich, de-duplicate
• What data needs refinement? – use
cases will determine
• Matching across massive datasets that
indicate a single specific entity
(person, company, product, etc.)
How dirty data hampers AI
Dimensional Research
27. Only 35% of senior
executives have a
high level of trust in
the accuracy of
their Big Data
Analytics*
92% of executives are
concerned about
the negative impact
of data and
analytics on
corporate
reputation*
Cost of poor data
quality rose by 50%
in 2017
(Gartner)
84% of CEOs
are concerned
about
the quality of the
data they’re basing
decisions on*
• Decision making – Trust the
data that drives your
business
• Machine learning & AI –
Train your models on
accurate data
• Customer centricity – Get a
single, complete and
accurate view of customer
for better sales, marketing
and service
• Compliance – Know your
data, and ensure its
accuracy to meet industry
and government regulations
The Modern Data Pipeline Needs Data Quality
*http://kpmg.com/guardiansoftrust
28. Common Data Quality Problems
• Many data records with different
layouts
• Lack of standardization of the
different fields
• Misspellings
• Data sourced from third parties
does not contain all the necessary
fields
• Inconsistent data formats
(measurements, languages,
postal conventions and dates)
• Names spelled differently
• Different number formatting
29. Common Data Quality Problems at Scale
Common
Challenges
• Big Data projects require:
Massive scalability
Low latency
Many data sources for a
complete view
• Data Quality processing
using a standalone server
can’t keep up
Millions of business
transactions a day are
now common
Standalone quality projects
may take several hours;
unlikely to meet end user
SLAs and/or key success
factors
Solution
Trillium Quality for Big Data
enables you to leverage the
power and scalability of Big
Data frameworks like
Spark, MapReduce
Performs data quality jobs
natively on the cluster
Leverages Intelligent Execution
– design once, deploy
anywhere – cloud, multi-
cloud, hybrid or on prem
No need to move/copy data for
quality processing; Big Data
remains in place
No coding or tuning; jobs are
automatically optimized
Benefits
• Data Pipeline delivers trusted
data for analytics
• Robust data quality processing
at Big Data scale to meet SLAs,
support use cases like Anti-
Money Laundering or
Customer 360
• No coding or tuning saves
time and resources – and
helps address Big Data skills
shortages
• Save time and network
resources by keeping data in
place
30. Cleanse data in Hadoop / Cloud
Transform, join, cleanse
and enhance data in
cluster with Spark or
MapReduce. Excellent
performance every time.
Data
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Data Sources Data Lake
31. Get end-to-end data lineage
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-
cluster data
lineage info to
REST API and
Navigator or Atlas.
Data Lake
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join, cleanse
and enhance data in
cluster with Spark or
MapReduce. Excellent
performance every time
Data changes
separately made
by MapReduce,
Spark, HiveQL.
33. 33
Analysts Get Complete Picture with Trusted Data Provenance
Data Sources
Auditors
get end-to-
end data
lineage.
Analytics,
visualizations, and
machine learning
algorithms get
clean, complete
data.
Data Lake
Analytics,
Visualization,
Machine
Learning
Data changes
separately made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
Clean,
Complete
Data
RES
T
API
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and
batch sources
outside cluster.
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-
cluster data
lineage info to
REST API and
Navigator or
Atlas.
Transform, join, cleanse
and enhance data in
cluster with Spark or
MapReduce. Excellent
performance every time
34. Forrester Research
The path to enterprise AI is full of twists
and turns, false starts, and lessons to
learn.
Surely without data quality, AI and
other advanced technologies can not
live up to their expectations.
The Refined Zone may be another cluster, another part of the same cluster, a Cloud, an analytic database, wherever the data sets can be easily stored and found by the people who need them. Select data sets based on use cases. Start with a use case that requires relatively few data sets and/or has relatively high business value. Get immediate ROI for that use case, then move to the next. Once a data set has been refined, it’s there for other use cases that might need the same data. Build on that by refining additional data sets for the next use case. And so on.
That’s a data marketplace, and why you need one.
IT is transforming to handle a combination of on premise, infrastructure-as-a-service, platform-as-a-service, and software-as-a-service. The best architecture will make choices affordable so an architecture with multiple cloud vendors is just as easy and powerful as using a single cloud. Going all-in on one cloud architecture puts IT in the same weak, single source position that many customers of companies such as Oracle find themselves in today. No matter what the current management of those vendors say, future managers will exploit this weakness to increase revenue.
It is crucial that you do as much of the detailed work of handling complex programming, rules, transformations, and other forms of coding in ways that protect you from changes in the underlying infrastructure. The ideal form of expression of coding is in a system that could operate on-premises or in any cloud.
Syncsort Connect for Big Data is specifically designed to simplify the process of accessing, integrating, governing and securing all your enterprise data – batch and streaming – in a single software environment. With Connect for Big Data you can:
Visually design your jobs once, and deploy them anywhere – MapReduce, Spark, Linux, Unix, Windows – on premise or in the cloud. No changes or tuning required.
Easily move applications from standalone server environments and from MapRedue to Spark – as easy as clicking on a drop-down menu
Future-proof job designs for emerging compute frameworks
Avoid tuning -- Intelligent Execution dynamically plans for applications at run-time based on the chosen compute framework
Insulate your users from the underlying complexities of Hadoop and use existing ETL skills
Cut development time in half