This document provides an overview of a presentation on advanced analytics, big data, and being a data scientist. The presentation agenda includes an introduction to data science, why the presenter became a data scientist, definitions of data science, data science skillsets, the data science process for one-off projects versus production pipelines, various data science tools, and a question and answer section. The document outlines each section in detail with examples.
This document provides an overview of data science including:
- Definitions of data science and the motivations for its increasing importance due to factors like big data, cloud computing, and the internet of things.
- The key skills required of data scientists and an overview of the data science process.
- Descriptions of different types of databases like relational, NoSQL, and data warehouses versus data lakes.
- An introduction to machine learning, data mining, and data visualization.
- Details on courses for learning data science.
This document provides an introduction to data science. It discusses why data science is important and covers key techniques like statistics, data mining, and visualization. It also reviews popular tools and platforms for data science like R, Hadoop, and real-time systems. Finally, it discusses how data science can be applied across different business domains such as financial services, telecom, retail, and healthcare.
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
This document provides an introduction to data science and analytics. It discusses why data science jobs are in high demand, what skills are needed for these roles, and common types of analytics including descriptive, predictive, and prescriptive. It also covers topics like machine learning, big data, structured vs unstructured data, and examples of companies that utilize data and analytics like Amazon and Facebook. The document is intended to explain key concepts in data science and why attending a talk on this topic would be beneficial.
Data Scientist has been regarded as the sexiest job of the twenty first century. As data in every industry keeps growing the need to organize, explore, analyze, predict and summarize is insatiable. Data Science is creating new paradigms in data driven business decisions. As the field is emerging out of its infancy a wide range of skill sets are becoming an integral part of being a Data Scientist. In this talk I will discuss the different driven roles and the expertise required to be successful in them. I will highlight some of the unique challenges and rewards of working in a young and dynamic field.
Session 01 designing and scoping a data science projectbodaceacat
This document provides an overview of the first session in a data science training series. It discusses designing and scoping a data science project. Key points include: defining data science and the data science process; describing the roles of problem owners and competitors; reviewing examples of data science competitions from Kaggle, DrivenData, and DataKind; and providing guidance on writing an effective problem statement by specifying the context, needs, vision, and intended outcomes of a project. The document also briefly covers data science ethics considerations like ensuring privacy and minimizing risks. Exercises are included to help participants practice asking interesting questions, identifying relevant data sources, and designing communications for target audiences.
Introduction to Data Science (Data Summit, 2017)Caserta
This document summarizes an introduction to data science presentation by Joe Caserta and Bill Walrond of Caserta Concepts. Caserta Concepts is an internationally recognized data innovation and engineering consulting firm. The agenda covers why data science is important, challenges of working with big data, governing big data, the data pyramid, what data scientists do, standards for data science, and a demonstration of data analysis. Popular machine learning algorithms like regression, decision trees, k-means clustering and collaborative filtering are also discussed.
Data science vs. Data scientist by Jothi PeriasamyPeter Kua
This document discusses data science vs data scientists and outlines key competencies for data scientists. It defines data science as modernizing existing analytics and data solutions using new data sources, formats, architectures, and techniques. The document compares traditional and modern approaches to data and analytics. It also discusses the skills required of entry-level vs senior data scientists, noting that enterprise data scientists require strong industry and business process skills while focusing on data, analytics, communication and technical abilities. The document provides an overview of the roles, responsibilities and deliverables of data scientists on enterprise projects.
This document provides an overview of data science including:
- Definitions of data science and the motivations for its increasing importance due to factors like big data, cloud computing, and the internet of things.
- The key skills required of data scientists and an overview of the data science process.
- Descriptions of different types of databases like relational, NoSQL, and data warehouses versus data lakes.
- An introduction to machine learning, data mining, and data visualization.
- Details on courses for learning data science.
This document provides an introduction to data science. It discusses why data science is important and covers key techniques like statistics, data mining, and visualization. It also reviews popular tools and platforms for data science like R, Hadoop, and real-time systems. Finally, it discusses how data science can be applied across different business domains such as financial services, telecom, retail, and healthcare.
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
This document provides an introduction to data science and analytics. It discusses why data science jobs are in high demand, what skills are needed for these roles, and common types of analytics including descriptive, predictive, and prescriptive. It also covers topics like machine learning, big data, structured vs unstructured data, and examples of companies that utilize data and analytics like Amazon and Facebook. The document is intended to explain key concepts in data science and why attending a talk on this topic would be beneficial.
Data Scientist has been regarded as the sexiest job of the twenty first century. As data in every industry keeps growing the need to organize, explore, analyze, predict and summarize is insatiable. Data Science is creating new paradigms in data driven business decisions. As the field is emerging out of its infancy a wide range of skill sets are becoming an integral part of being a Data Scientist. In this talk I will discuss the different driven roles and the expertise required to be successful in them. I will highlight some of the unique challenges and rewards of working in a young and dynamic field.
Session 01 designing and scoping a data science projectbodaceacat
This document provides an overview of the first session in a data science training series. It discusses designing and scoping a data science project. Key points include: defining data science and the data science process; describing the roles of problem owners and competitors; reviewing examples of data science competitions from Kaggle, DrivenData, and DataKind; and providing guidance on writing an effective problem statement by specifying the context, needs, vision, and intended outcomes of a project. The document also briefly covers data science ethics considerations like ensuring privacy and minimizing risks. Exercises are included to help participants practice asking interesting questions, identifying relevant data sources, and designing communications for target audiences.
Introduction to Data Science (Data Summit, 2017)Caserta
This document summarizes an introduction to data science presentation by Joe Caserta and Bill Walrond of Caserta Concepts. Caserta Concepts is an internationally recognized data innovation and engineering consulting firm. The agenda covers why data science is important, challenges of working with big data, governing big data, the data pyramid, what data scientists do, standards for data science, and a demonstration of data analysis. Popular machine learning algorithms like regression, decision trees, k-means clustering and collaborative filtering are also discussed.
Data science vs. Data scientist by Jothi PeriasamyPeter Kua
This document discusses data science vs data scientists and outlines key competencies for data scientists. It defines data science as modernizing existing analytics and data solutions using new data sources, formats, architectures, and techniques. The document compares traditional and modern approaches to data and analytics. It also discusses the skills required of entry-level vs senior data scientists, noting that enterprise data scientists require strong industry and business process skills while focusing on data, analytics, communication and technical abilities. The document provides an overview of the roles, responsibilities and deliverables of data scientists on enterprise projects.
Una breve introduzione alla data science e al machine learning con un'enfasi sugli scenari applicativi, da quelli tradizionali a quelli più innovativi. La overview copre la definizione di base di data science, una overview del machine learning e esempi su scenari tradizionali, Recommender systems e Social Network Analysis, IoT e Deep Learning
Presentation at Data ScienceTech Institute campuses, Paris and Nice, May 2016 , including Intro, Data Science History and Terms; 10 Real-World Data Science Lessons; Data Science Now: Polls & Trends; Data Science Roles; Data Science Job Trends; and Data Science Future
A Practical-ish Introduction to Data ScienceMark West
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
1. I'll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
2. Next up well run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
The document outlines the typical lifecycle of a data science project, including business requirements, data acquisition, data preparation, hypothesis and modeling, evaluation and interpretation, and deployment. It discusses collecting data from various sources, cleaning and integrating data in the preparation stage, selecting and engineering features, building and validating models, and ultimately deploying results.
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
This document provides an introduction to data science. It begins with defining data science and its interdisciplinary nature, drawing from fields like computer science, mathematics, statistics, and domain-specific knowledge. It then discusses machine learning as a tool in data science and provides examples of common machine learning algorithms like linear regression, decision trees, and k-means clustering. It also outlines different roles required for data science projects. The document aims to give a practical overview of key concepts in data science.
This document outlines a data science enablement roadmap created by the Advanced Center of Excellence at Modern Renaissance Corporation. The roadmap consists of 1 introductory course and 3 advanced courses that can earn a student a master's level certificate in data science. The introductory course provides a broad overview of topics like algorithms, statistics, machine learning, and big data platforms. The advanced courses focus on specific skills like machine learning with R, modern data platforms using Hadoop, and advanced big data analytics techniques. The goal is to give students a versatile, practical skill set for a career in data science or big data engineering.
This document provides an overview of the data science process and tools for a data science project. It discusses identifying important business questions to answer with data, extracting relevant data from sources, cleaning and sampling the data, analyzing samples to create models and check hypotheses, applying results to full data sets, visualizing findings, automating and deploying solutions, and continuously learning and improving through an iterative process. Key tools mentioned include Hadoop, R, Python, Excel, and various data wrangling, analysis, and visualization tools.
This document discusses the rise of big data and data science. It notes that while data volumes are growing exponentially, data alone is just an asset - it is data scientists that create value by building data products that provide insights. The document outlines the data science workflow and highlights both the tools used and challenges faced by data scientists in extracting value from big data.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
Big Data and Data Science have become increasingly imperative areas in both industry and academia to the extent that every company wants to hire a Data Scientist and every university wants to start dedicated degree programs and centres of excellence in Data Science. Big Data and Data Science have led to technologies that have already shaped different aspects of our lives such as learning, working, travelling, purchasing, social relationships, entertainments, physical activities, medical treatments, etc. This talk will attempt to cover the landscape of some of the important topics in these exponentially growing areas of Data Science and Big Data including the state-of-the-art processes, commercial and open-source platforms, data processing and analytics algorithms (specially large scale Machine Learning), application areas in academia and industry, the best industry practices, business challenges and what it takes to become a Data Scientist.
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
This Edureka Data Science course slides will take you through the basics of Data Science - why Data Science, what is Data Science, use cases, BI vs Data Science, Data Science tools and Data Science lifecycle process. This is ideal for beginners to get started with learning data science.
You can read the blog here: https://goo.gl/OoDCxz
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Here are some key points to consider when designing visuals:
- Who is your audience? What information do they need?
- What insights or messages do you want to convey?
- Consider different visualisation types and choose those best suited to your data and goals
- Use visual hierarchy, layout and formatting to guide the eye and message
- Iteratively sketch, test and refine your designs with your intended users
- Balance simplicity and clarity with including all necessary information
The design process is iterative. Start broadly and refine based on testing with intended users. Focus on conveying the most important insights as simply as possible.
The document describes a 10 module data science course covering topics such as introduction to data science, machine learning techniques using R, Hadoop architecture, and Mahout algorithms. The course includes live online classes, recorded lectures, quizzes, projects, and a certificate. Each module covers specific data science topics and techniques. The document provides details on the course content, objectives, and topics covered in module 1 which includes an introduction to data science, its components, use cases, and how to integrate R and Hadoop. Examples of data science applications in various domains like healthcare, retail, and social media are also presented.
This document discusses the need for a new paradigm in big data analytics using algorithms. It begins by describing the limitations of traditional analytics approaches like statistical analysis, data mining, visualization and business intelligence tools when applied to big data. These approaches are query-based and labor intensive. Emerging big data tools like Hadoop and in-memory databases help with storage and queries but do not provide automated insights. The document argues that the new paradigm should focus on algorithms that can automatically surface insights from data in seconds, replacing the need for data analysts to manually query databases. This represents a shift from humans digging for insights to algorithms surfacing insights for humans to evaluate.
My presentation on Data Mining, Lessons from Competitions, and Public Data looks at the Data Mining/Data Science/Big Data evolution, reviews lessons from KDD Cup 1997, Netflix Prize, and Kaggle, presents a big list of Public and Government data APIs, Marketplaces, Portals, and Platforms, and examines Big Data Hype. This talk was given at BPDM-2013, (Broadening Participation in Data Mining), Aug 10, 2013 held at KDD-2013, Chicago.
This document provides an overview of data science and its applications. It discusses:
1) Industries that are being disrupted by data science like telecom, banking, retail, and healthcare.
2) How companies like Amazon, Netflix, and Google were able to disrupt their industries through their ability to analyze patterns in data faster than competitors.
3) The factors driving more companies to adopt data science including competitive advantages, revenue growth, and cost optimization.
A look back at how the practice of data science has evolved over the years, modern trends, and where it might be headed in the future. Starting from before anyone had the title "data scientist" on their resume, to the dawn of the cloud and big data, and the new tools and companies trying to push the state of the art forward. Finally, some wild speculation on where data science might be headed.
Presentation given to Seattle Data Science Meetup on Friday July 24th 2015.
Introduction to Big Data and its TrendsJongwook Woo
Big Data has been popular last 10 years using Hadoop and Spark for data analysis and prediction with large scale data sets in distributed parallel computing systems. Its platform has expanded using NoSQL DB and Search Engine as well and has been more popular along cloud computing. Then, Deep Learning has become a buzzword past several years using GPU and Big Data. It makes even small companies and labs to own supercomputers with a small amount of budgets, which is the situation of “Dream Comes True” in the IT and business. In this talk, the history and trends of Big Data and AI platforms are introduced and Big Data predictive analysis should be presented.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
This document provides an overview of data science including what is big data and data science, applications of data science, and system infrastructure. It then discusses recommendation systems in more detail, describing them as systems that predict user preferences for items. A case study on recommendation systems follows, outlining collaborative filtering and content-based recommendation algorithms, and diving deeper into collaborative filtering approaches of user-based and item-based filtering. Challenges with collaborative filtering are also noted.
The document summarizes key trends from the 2015 Internet Trends report by Mary Meeker. It outlines that while global internet and smartphone user growth is still solid, the growth rate is slowing as adoption increases. It also notes that incremental users will be harder to obtain as adoption depends more on developing markets. Internet usage and engagement growth remains strong, especially for mobile video. Mobile advertising is growing faster than desktop but still lags in share of total internet advertising spending. The document also highlights new advertising formats and payment options optimized for mobile usage as well as the rise of vertical video viewing. Finally, it discusses how enterprise technology startups are reimagining business processes by addressing prior pain points in areas like communications, payments, analytics and
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
This presentation gives an brief overview of the history of relational databases, ACID and SQL and presents some of the key strentgths and potential weaknesses. It introduces the rise of NoSQL - why it arose, what is entails, when to use it. The presentation focuses on MongoDB as prime example of NoSQL document store and it shows how to interact with MongoDB from JavaScript (NodeJS) and Java.
Una breve introduzione alla data science e al machine learning con un'enfasi sugli scenari applicativi, da quelli tradizionali a quelli più innovativi. La overview copre la definizione di base di data science, una overview del machine learning e esempi su scenari tradizionali, Recommender systems e Social Network Analysis, IoT e Deep Learning
Presentation at Data ScienceTech Institute campuses, Paris and Nice, May 2016 , including Intro, Data Science History and Terms; 10 Real-World Data Science Lessons; Data Science Now: Polls & Trends; Data Science Roles; Data Science Job Trends; and Data Science Future
A Practical-ish Introduction to Data ScienceMark West
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
1. I'll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
2. Next up well run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
The document outlines the typical lifecycle of a data science project, including business requirements, data acquisition, data preparation, hypothesis and modeling, evaluation and interpretation, and deployment. It discusses collecting data from various sources, cleaning and integrating data in the preparation stage, selecting and engineering features, building and validating models, and ultimately deploying results.
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
This document provides an introduction to data science. It begins with defining data science and its interdisciplinary nature, drawing from fields like computer science, mathematics, statistics, and domain-specific knowledge. It then discusses machine learning as a tool in data science and provides examples of common machine learning algorithms like linear regression, decision trees, and k-means clustering. It also outlines different roles required for data science projects. The document aims to give a practical overview of key concepts in data science.
This document outlines a data science enablement roadmap created by the Advanced Center of Excellence at Modern Renaissance Corporation. The roadmap consists of 1 introductory course and 3 advanced courses that can earn a student a master's level certificate in data science. The introductory course provides a broad overview of topics like algorithms, statistics, machine learning, and big data platforms. The advanced courses focus on specific skills like machine learning with R, modern data platforms using Hadoop, and advanced big data analytics techniques. The goal is to give students a versatile, practical skill set for a career in data science or big data engineering.
This document provides an overview of the data science process and tools for a data science project. It discusses identifying important business questions to answer with data, extracting relevant data from sources, cleaning and sampling the data, analyzing samples to create models and check hypotheses, applying results to full data sets, visualizing findings, automating and deploying solutions, and continuously learning and improving through an iterative process. Key tools mentioned include Hadoop, R, Python, Excel, and various data wrangling, analysis, and visualization tools.
This document discusses the rise of big data and data science. It notes that while data volumes are growing exponentially, data alone is just an asset - it is data scientists that create value by building data products that provide insights. The document outlines the data science workflow and highlights both the tools used and challenges faced by data scientists in extracting value from big data.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
Big Data and Data Science have become increasingly imperative areas in both industry and academia to the extent that every company wants to hire a Data Scientist and every university wants to start dedicated degree programs and centres of excellence in Data Science. Big Data and Data Science have led to technologies that have already shaped different aspects of our lives such as learning, working, travelling, purchasing, social relationships, entertainments, physical activities, medical treatments, etc. This talk will attempt to cover the landscape of some of the important topics in these exponentially growing areas of Data Science and Big Data including the state-of-the-art processes, commercial and open-source platforms, data processing and analytics algorithms (specially large scale Machine Learning), application areas in academia and industry, the best industry practices, business challenges and what it takes to become a Data Scientist.
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
This Edureka Data Science course slides will take you through the basics of Data Science - why Data Science, what is Data Science, use cases, BI vs Data Science, Data Science tools and Data Science lifecycle process. This is ideal for beginners to get started with learning data science.
You can read the blog here: https://goo.gl/OoDCxz
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Here are some key points to consider when designing visuals:
- Who is your audience? What information do they need?
- What insights or messages do you want to convey?
- Consider different visualisation types and choose those best suited to your data and goals
- Use visual hierarchy, layout and formatting to guide the eye and message
- Iteratively sketch, test and refine your designs with your intended users
- Balance simplicity and clarity with including all necessary information
The design process is iterative. Start broadly and refine based on testing with intended users. Focus on conveying the most important insights as simply as possible.
The document describes a 10 module data science course covering topics such as introduction to data science, machine learning techniques using R, Hadoop architecture, and Mahout algorithms. The course includes live online classes, recorded lectures, quizzes, projects, and a certificate. Each module covers specific data science topics and techniques. The document provides details on the course content, objectives, and topics covered in module 1 which includes an introduction to data science, its components, use cases, and how to integrate R and Hadoop. Examples of data science applications in various domains like healthcare, retail, and social media are also presented.
This document discusses the need for a new paradigm in big data analytics using algorithms. It begins by describing the limitations of traditional analytics approaches like statistical analysis, data mining, visualization and business intelligence tools when applied to big data. These approaches are query-based and labor intensive. Emerging big data tools like Hadoop and in-memory databases help with storage and queries but do not provide automated insights. The document argues that the new paradigm should focus on algorithms that can automatically surface insights from data in seconds, replacing the need for data analysts to manually query databases. This represents a shift from humans digging for insights to algorithms surfacing insights for humans to evaluate.
My presentation on Data Mining, Lessons from Competitions, and Public Data looks at the Data Mining/Data Science/Big Data evolution, reviews lessons from KDD Cup 1997, Netflix Prize, and Kaggle, presents a big list of Public and Government data APIs, Marketplaces, Portals, and Platforms, and examines Big Data Hype. This talk was given at BPDM-2013, (Broadening Participation in Data Mining), Aug 10, 2013 held at KDD-2013, Chicago.
This document provides an overview of data science and its applications. It discusses:
1) Industries that are being disrupted by data science like telecom, banking, retail, and healthcare.
2) How companies like Amazon, Netflix, and Google were able to disrupt their industries through their ability to analyze patterns in data faster than competitors.
3) The factors driving more companies to adopt data science including competitive advantages, revenue growth, and cost optimization.
A look back at how the practice of data science has evolved over the years, modern trends, and where it might be headed in the future. Starting from before anyone had the title "data scientist" on their resume, to the dawn of the cloud and big data, and the new tools and companies trying to push the state of the art forward. Finally, some wild speculation on where data science might be headed.
Presentation given to Seattle Data Science Meetup on Friday July 24th 2015.
Introduction to Big Data and its TrendsJongwook Woo
Big Data has been popular last 10 years using Hadoop and Spark for data analysis and prediction with large scale data sets in distributed parallel computing systems. Its platform has expanded using NoSQL DB and Search Engine as well and has been more popular along cloud computing. Then, Deep Learning has become a buzzword past several years using GPU and Big Data. It makes even small companies and labs to own supercomputers with a small amount of budgets, which is the situation of “Dream Comes True” in the IT and business. In this talk, the history and trends of Big Data and AI platforms are introduced and Big Data predictive analysis should be presented.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
This document provides an overview of data science including what is big data and data science, applications of data science, and system infrastructure. It then discusses recommendation systems in more detail, describing them as systems that predict user preferences for items. A case study on recommendation systems follows, outlining collaborative filtering and content-based recommendation algorithms, and diving deeper into collaborative filtering approaches of user-based and item-based filtering. Challenges with collaborative filtering are also noted.
The document summarizes key trends from the 2015 Internet Trends report by Mary Meeker. It outlines that while global internet and smartphone user growth is still solid, the growth rate is slowing as adoption increases. It also notes that incremental users will be harder to obtain as adoption depends more on developing markets. Internet usage and engagement growth remains strong, especially for mobile video. Mobile advertising is growing faster than desktop but still lags in share of total internet advertising spending. The document also highlights new advertising formats and payment options optimized for mobile usage as well as the rise of vertical video viewing. Finally, it discusses how enterprise technology startups are reimagining business processes by addressing prior pain points in areas like communications, payments, analytics and
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
This presentation gives an brief overview of the history of relational databases, ACID and SQL and presents some of the key strentgths and potential weaknesses. It introduces the rise of NoSQL - why it arose, what is entails, when to use it. The presentation focuses on MongoDB as prime example of NoSQL document store and it shows how to interact with MongoDB from JavaScript (NodeJS) and Java.
Our data should be secure. And our environment too. What we can do for maximizing security in a hybrid environment, where SQL Server exist in two forms: premise and cloud. How to organize our job, how to control our data if we use Windows Azure SQL Database - The Cloud Database. physical security, policy-based management, auditing, encryption, federation, access and authorization. All of those subjects will be covered during my session.
MongoDB NoSQL database a deep dive -MyWhitePaperRajesh Kumar
This document provides an overview of MongoDB, a popular NoSQL database. It discusses why NoSQL databases were created, the different types of NoSQL databases, and focuses on MongoDB. MongoDB is a document-oriented database that stores data in JSON-like documents with dynamic schemas. It provides horizontal scaling, high performance, and flexible data models. The presentation covers MongoDB concepts like databases, collections, documents, CRUD operations, indexing, sharding, replication, and use cases. It provides examples of modeling data in MongoDB and considerations for data and schema design.
The document discusses various options for funding an independent film, including competitions, crowd funding, bank loans, funding from organizations like the British Film Institute, deferrals, and conclusions. It analyzes the advantages and disadvantages of each approach. The author concludes that crowd funding would be the most suitable option for their film, as it allows gathering an audience and investment from interested supporters while maintaining creative control, though advertising costs must also be considered.
2017 iosco research report on financial technologies (fintech)Ian Beckett
This document provides an overview of financial technologies (Fintech) and their intersection with securities markets regulation. It examines alternative financing platforms, retail trading/investment platforms, institutional trading platforms, and distributed ledger technologies. The report finds that Fintech is transforming traditional financial services through new business models and technologies. This raises regulatory questions around benefits/risks and implications for investor protection, market integrity, and stability. The document incorporates survey responses from global regulators on their experiences with Fintech.
Cloud computing gives you a number of advantages, such as the ability to scale your web application or website on demand. If you have a new web application and want to use cloud computing, you might be asking yourself, "Where do I start?" Join us in this session to understand best practices for scaling your resources from zero to millions of users. We show you how to best combine different AWS services, how to make smarter decisions for architecting your application, and how to scale your infrastructure in the cloud.
This document summarizes a legal research paper about regulating corporate venture capital (CVC). It finds that CVC has grown dramatically since 2008 and now plays an important role in startup financing and the rise of "unicorns" (private companies valued over $1 billion). However, CVC faces little regulation. The paper aims to address this by analyzing the legal implications of CVC in two areas: securities regulation and conflicts of interest. It examines case studies of several prominent CVC firms like GV and Intel Capital to understand current disclosure practices and argues more transparency is needed given CVC's influence on private markets and company boards.
The report offers to marketers 25 open, click-through, list churn and mobile metrics to help you see where you rank, delivering more visuals so you can better understand the data, and sharing more observations to help you improve your marketing programs.
Dokumen tersebut membahas tentang database, termasuk pengertian database, jenis database, dan perbedaan antara database relasional dan non-relasional (NoSQL). Database dijelaskan sebagai kumpulan informasi yang disimpan secara sistematis untuk memperoleh informasi, sedangkan database relasional menyimpan data dalam bentuk tabel yang saling berhubungan dan NoSQL menyederhanakan proses database dengan menghilangkan redudansi data.
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION Elvis Muyanja
Today, data science is enabling companies, governments, research centres and other organisations to turn their volumes of big data into valuable and actionable insights. It is important to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. According to the McKinsey Global Institute, the U.S. alone could face a shortage of about 190,000 data scientists and 1.5 million managers and analysts who can understand and make decisions using big data by 2018. In coming years, data scientists will be vital to all sectors —from law and medicine to media and nonprofits. Has the African continent planned to train the next generation of data scientists required on the continent?
Europa AI startup scaleups report 2016 Ian Beckett
- €1.8 billion was invested across 306 deals in 22 European countries in 2016 for artificial intelligence and data analytics startups. The UK received the most funding and had the most deals.
- Common business models included content-driven platforms and marketplaces. Most companies pursued B2B models.
- Advertising/marketing was the top industry for investment, followed by fintech and business intelligence. London, Paris, and Berlin were leading cities.
Meetup sthlm - introduction to Machine Learning with demo casesZenodia Charpy
This document provides an agenda and overview of topics related to data science and machine learning. It discusses data science processes including data preparation, algorithm selection, model deployment, and performance measurement. It also distinguishes machine learning from artificial intelligence and describes common machine learning algorithms like supervised and unsupervised learning. Examples of supervised and unsupervised learning applications are presented along with generic workflows. Machine learning algorithm selection and example cases are also summarized.
The document provides a description of data scientist positions at three levels - Data Scientist I, II, and III. It outlines the general characteristics and responsibilities expected for each level, with level III involving the most complex work, responsibilities for leading projects, and experience/education qualifications. Key responsibilities include data analysis, modeling, collaborating with stakeholders, and communicating results.
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
This is my presentation on the Topic "Data Science - An emerging Stream of Science with its Spreading Reach & Impact". I have compiled and collected different statistics and data from different sources. This may be useful for students and those who might be interested in this field of Study.
TechWise with Eric Kavanagh, Dr. Robin Bloor and Dr. Kirk Borne
Live Webcast on July 23, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=59d50a520542ee7ed00a0c38e8319b54
Analytical applications are everywhere these days, and for good reason. Organizations large and small are using analytics to better understand any aspect of their business: customers, processes, behaviors, even competitors. There are several critical success factors for using analytics effectively: 1) know which kind of apps make sense for your company; 2) figure out which data sets you can use, both internal and external; 3) determine optimal roles and responsibilities for your team; 4) identify where you need help, either by hiring new employees or using consultants 5) manage your program effectively over time.
Register for this episode of TechWise to learn from two of the most experienced analysts in the business: Dr. Robin Bloor, Chief Analyst of The Bloor Group, and Dr. Kirk Borne, Data Scientist, George Mason University. Each will provide their perspective on how companies can address each of the key success factors in building, refining and using analytics to improve their business. There will then be an extensive Q&A session in which attendees can ask detailed questions of our experts and get answers in real time. Registrants will also receive a consolidated deck of slides, not just from the main presenters, but also from a variety of software vendors who provide targeted solutions.
Visit InsideAnlaysis.com for more information.
3 джозеп курто превращаем вашу организацию в big data компаниюantishmanti
The document discusses transforming an organization into a Big Data company. It outlines the challenges of digital disruption and how companies like Amazon, Apple, Google and Netflix understand customers through their digital footprints. It then discusses six challenges of Big Data including data capture, storage, analysis, visualization, IT dependence, and creating a new culture. The remainder of the document focuses on business models for Big Data and implementing Big Data strategies and projects within an organization.
The document outlines an agenda for a presentation on big data. It discusses key topics like the state of big data adoption, a holistic approach to big data, five high value use cases, technical components, and the future of big data and cloud. The presentation aims to provide an overview of big data and how organizations can take a comprehensive approach to leveraging their data assets.
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3uqcAN0
Self-service is a major goal of modern data strategists. A successfully implemented self-service initiative means that business users have access to holistic and consistent views of data regardless of its location, source or type. As data unification and data collaboration become key critical success factors for organizations, data catalogs play a key role as the perfect companion for a virtual layer to fully empower those self-service initiatives and build a self-service data marketplace requiring minimal IT intervention.
Denodo’s Data Catalog is a key piece in Denodo’s portfolio to bridge the gap between the technical data infrastructure and business users. It provides documentation, search, governance and collaboration capabilities, and data exploration wizards. It provides business users with the tool to generate their own insights with proper security, governance, and guardrails.
In this session we will cover:
- The role of a virtual semantic layer in self-service initiatives
- Key ingredients of a successful self-service data marketplace Self-service (consumption) vs. inventory catalogs
- Best practices and advanced tips for successful deployment
- A Demonstration: Product Demo
- Examples of customers using Denodo’s Data Catalog to enable self-service initiatives
Data mining involves extracting useful patterns from large amounts of data. It involves defining a problem, preparing data, exploring data, building models, and deploying models. Some common applications of data mining include analyzing customer purchasing patterns, detecting fraud, predicting disease outbreaks, and analyzing financial/business data. While data warehousing provides insights into past trends, data mining can discover hidden patterns to predict future trends and behaviors from data.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
This talk is an introduction to Data Science. It explains Data Science from two perspectives - as a profession and as a descipline. While covering the benefits of Data Science for business, It explaints how to get started for embracing data science in business.
Abstract:
Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.
This document summarizes an introductory presentation on data science. It introduces the presenter and their background in data and analytics. The goals of the presentation are to define what a data scientist is, how the field has emerged, and how to become one. It discusses the growing demand and salaries for data scientists. Examples are given of how data science has been applied at companies like LinkedIn and Netflix. The presentation covers big data, Hadoop, data processing techniques, machine learning algorithms, and tools used in data science. Finally, attendees are encouraged to consider Thinkful's data science bootcamp program.
This document discusses big data and data science. It addresses three main points:
1) Big data methods and algorithms can be useful for smaller datasets as well as large ones.
2) To successfully extract insights from data requires a team with a variety of skills, including business and domain knowledge.
3) For HR in particular, big data can help determine the optimal time to approach potential candidates by analyzing patterns in their job seeking activities online.
Tips and Tricks to be an Effective Data ScientistLisa Cohen
Data Science is an evolving field, that requires a diverse skill set. From Analytical Techniques to Career Advice, this talk is full of practical tips that you can apply immediately to your job.
Data Science and AI in Biomedicine: The World has ChangedPhilip Bourne
This document discusses the changing landscape of data science and AI in biomedicine. Some key points:
- We are at a tipping point where data science is becoming a driver of biomedical research rather than just a tool. Biomedical researchers need to become data scientists.
- Data science is interdisciplinary and touches every field due to the rise of digital data. It requires openness, translation of findings, and consideration of responsibilities like algorithmic bias.
- Advances like AlphaFold2 show the power of large collaborative efforts combining data, computing resources, engineering, and domain expertise. This points to the need for public-private partnerships and new models of open data sharing.
- The definition of
DAS Slides: Graph Databases — Practical Use CasesDATAVERSITY
Graph databases are seeing a spike in popularity as their value in leveraging large data sets for key areas such as fraud detection, marketing, and network optimization become increasingly apparent. With graph databases, it’s been said that ‘the data model and the metadata are the database’. What does this mean in a practical application, and how can this technology be optimized for maximum business value?
This document provides an introduction to data mining and data warehousing. It discusses how the volume of data being collected is growing exponentially in many fields due to advances in data collection technologies. It also describes how data mining can be used to extract useful knowledge and patterns from large datasets to help solve important problems. The document outlines some key techniques in data mining including classification, clustering, and association rule mining. It discusses how data mining draws from fields like machine learning, statistics, and databases to analyze large and complex datasets.
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Sri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/xc3j20Om3UM
Description:
Data science is indeed one of the sexy jobs of the 21st century. But it is also a lot of hard work. And the hard work is seldom about the math or the algorithms. It is about building relevant machine learning products for the real world. We will go over some of the must-haves as you take your machine learning model out of the sandbox and make it work in the big, bad world outside.
Speaker's Bio:
Krish Swamy is an experienced professional with deep skills in applying analytics and BigData capabilities to challenging business problems and driving customer insights. Krish's analytic experience includes marketing and pricing, credit risk, digital analytics and most recently, big data analytics and data transformation. His key experiences lie in banking and financial services, the digital customer experience domain, with a background in management consulting. Other key skills include influencing organizational change towards a data and analytics driven culture, and building teams of analysts, statisticians and data scientists.
Here is the analysis using Bonferroni's principle:
- Assume equal probability of each color selling (1/6 probability per sale)
- Calculate the probability of observing X or fewer sales for each color
- If the probability is very low (e.g. p < 0.05), then we can say that color is "not selling"
This approach avoids falsely claiming a color is not selling, when the low sales could just be due to chance in a small sample. It provides a rigorous statistical test to help make the decision.
Bonferroni's Principle; Task Example
Bonferroni's Principle; Task Example
Red: 2 sales
P(X ≤ 2) = (1
1) Jordan Engbers is a chief scientist and CTO who has experience in bioinformatics, neuroscience, clinical data science, and founding two data science companies.
2) Data science is a multidisciplinary field that uses techniques from many areas like statistics, computer science, and domain knowledge to understand data and help improve decision making.
3) The impact of data science comes from developing data products - tools that deliver insights from data to drive better decisions. This requires both scientific rigor and software engineering practices.
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
The document provides an overview of data analytics and big data concepts. It discusses the characteristics of big data, including the four V's of volume, velocity, variety and veracity. It also describes different types of data like structured, semi-structured and unstructured data. The document then introduces big data platforms and tools like Hadoop, Spark and Cassandra. Finally, it discusses the need for data analytics in business, including enabling better decision making and improving efficiency.
DeepLearning Experiments in Medical Image show case Zenodia Charpy
1. The document discusses assessing and selecting a successful computer vision proof-of-concept (POC) project by defining the problem, ensuring properly annotated data, and evaluating the solvability and value of the problem.
2. It explains the different types of data, labels, and annotations needed for classification, segmentation, object detection, and other computer vision models.
3. The complexity and time required to develop each model is considered, with simpler problems having the shortest timelines.
4. Examples of computer vision models and their inputs and outputs are provided for classification, segmentation, object detection, and image translation tasks.
how to build a Length of Stay model for a ProofOfConcept projectZenodia Charpy
walk through end to end and in detail how a machine learning process on Healthcare related model works ( here i picked LengthOfStay probelm) as a touch point to start the discussion, the scope is set to POC
This document discusses using neural networks to build models for classifying pneumonia, performing video semantic segmentation, object detection in video, segmenting organs in CT scans, and learning to play games. It describes getting data, preprocessing, choosing a framework, designing the neural network architecture, training and validating the model, and making predictions in a Jupyter notebook demo. The concepts can apply to other scenarios involving images, video, or other input data.
This document outlines several case-based scenarios for demonstrating data science activities using Azure services. Six cases are described:
1) A playground for citizen data scientists to gain an end-to-end understanding of the data science process using a simple UI.
2) Using SQL databases and services for machine learning tasks when all data resides in SQL.
3) Parallel training of models on multiple datasets to automate and scale the training process.
4) Using GPU-enabled environments for training deep learning models requiring GPU acceleration.
5) Leveraging high-speed data processing services when working with large datasets over 1GB.
6) A basic sandbox environment for data scientists, engineers, and analysts providing pre-
Zenodia TechDays talks Oct 24-25 Stockholm KistamässanZenodia Charpy
TechDays Stockholm presentation
understand how AI learns
train an AI model to identify Pneumonia
deploy an AI model on Azure as a web service with databrick
comsume the model
generalize the model to other scenario ( flappy bird, CT scan segmentation...etc)
demo on own dataset (csv, dicom, image...etc) for each service how to apply, in practice ,data science with various Azure machine learning services vs when this service should be used in what scenario/datasets, demo azure services include -
Azure TSQL in database analytics
Azure Batch Service for multiple dataset + parallel model training
Azure BatchAI service for deep learning models with GPU acceleration
Azure databrick for deep learning + opencv (computer vision tasks) + sklearn (normal machine learning models)
Azure Data science virtual machine <-- a sandbox & shared environment for data science experiments
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
2. 1. Introduction to data science – where did it come from
2. Why did I become a data scientist ?
3. Definition of data science
4. Data science skillset map
5. Data science process – one off vs. production pipeline
6. Data science process breakdown – a bit more detail
7. Various Data Science tools
8. Q&A
Agenda of today
4. Google trend – what people are searching
1 2 3 4
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
1
2
3
4
7. Cloud computing
Virtualization
Data Science
Big Data
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
what people are searching – top 5 keywords
8. Examples of what make
the data so big
Source: http://cloud-dba-journey.blogspot.se/2013/10/demystifying-hadoop-for-data-architects.html
9. Data Science can help
to reveal these insights
Data Value from
business’s perspective
13. WHY ?
As an analyst for many years…
I realise …
14. Act on
Customer
Time (weekly) Time!
Time (weekly)
Time
(+6 months) Time (monthly)
Insight to action – too slow !
Request insights
The Analysts
Issues discovered
1. Data is not centralized /syncronized
2. Data quality is bad
3. Organization’s hierarchy slow down
decision making process
4. NO Common KPIs (isolated measurement)
5. Marketing Strategy strongly
depending on gut-feelings ( historical
reason )
6. Knowledge gaps & misconceptions
(focus on visualization, not necessary facts)
7. Insufficient information
( insufficient data sources to answer to
the given question)
monitor
marketers
Answering , usually in a
dashboard/reports … format
Analysing
15. How did it happened ?
Fragmented data view
1. Focus on Database as the only truth
2. Limited data sources ( mostly DB +
clickstreams)
3. Central data repository non-existed
4. Common definiton of a customer
non-existed
5. Customers’ ever-changing behavior
( historical vs real time behavioural data )
6. Marketers’ believes vs. real
evidence about the customers
18. Data Science can at least answer to SOME of those concerns !
But . . .
it heavily depends on how mature is the organization
19. Organization
Maturity
Data Maturity
Resistance to change
Isolated acceptance
Growing importance
Embracing throughout
business disciplines
Data-driven
product & organization
Fragmented data
(Ad-hoc reports focused)
Central Data lake
(exploratory analysis)
360 data view
In real time
(predictive analytics)
Data governance
(Data quality control)
Data driven enterprise
strategy
(recommender system)
Source : https://datafloq.com/read/five-levels-big-data-maturity-organisation/259
21. Data science is a "concept to unify statistics, data analysis and their
related methods" in order to "understand and analyze actual phenomena"
with data. It employs techniques and theories drawn from many fields
within the broad areas of mathematics, statistics, information science,
and computer science, in particular from the subdomains of machine
learning, classification, cluster analysis, data mining, databases,
and visualization.
Short definition (wikipedia)
22. Typical characteristics :
Is question specific
Bias-Variance tradeoff + over/under fitting
Split data into training , testing ( validation ) sets
Can be combined with other algorithms
Can utilize parallelization
Deal with all kinds of data (incl. unstructured)
Data mining technique ( for big data) is applied
Machine Learning(ML)
Predictive analytics
(Supervised Learning)
Typical Characteristic:
Focus on feature engineering ( variables selection)
Exploration vs exploitation
Prediction preformance decade quickly with time
Mostly ad-hoc | one-off based
Deal with all kinds of data ( when applying machine
learning) or else mostly structured|semi-structured data
Typical characteristics:
Ad-hoc based
Limited data blending
Mostly structured data ( from database)
Focus on historical statistic models
Modelling focus on finding correlation or
describing existed datasets
Inferential
+ Exploratory
+ Descriptive
Data Science synonyms … what includes what
26. Data Scientist – The skillset map
Unicorn version vs your own path !
27. Not on the map but equally important
Teamwork essentials -
• Story-telling
• visualization
• Cooperation/team building
• Inter-personal / inspiration coach
• Open mind
• Knowledge sharing
Personality traits –
• Extreme Curiosity
• Detective spirit
• Naive and stupid
• Strong ethic (data protection / privacy
law)
28. My journey – my own version
Tree Trunk :
Skillsets yet to
be acquired
Math
(University)
Statistic
(University)
Computer
Science
(Master)
The ground
Data Science threshold
Specialization areas/
Further development
• Programming : R & Python
• Machine Learning Algorithms
• Data mining techniques
• Cloud services (Virtualization concepts)
• Big Data Eco systems
• Bayesian Statistics
• Graph Theory (option)
• Text mining techniques(option)
Analyst
(work experience)
Roots :
Your initial foundation
• Leadership /Team building
• Recommender system
• Experimental design
• Game theory
• Story-telling/presentation skills
• New model development
• Deep Learning artificial
Intelligence
Tree branches & leaves :
Specialized interests
Motivation
is the key !
31. What motivate you ?
What would your path look like ?
(15 mins Break)
32. Refresh our memory from previous section -
• Relationship between data science and big data
• What motivate me to become a data scientist
• The definition of data science and it’s closely related
synonym
• The skillset map for becoming a data scientist ( unicorn
version vs. your own)
• Motivation is the key !
41. Where are these two approaches came from ?
due to organization maturity . . .
42. Traditional
BI
Data- Driven
Organization
& Products
Data silos –
Fragmented data views
Resistance to
Change
Isolated
Acceptance
DataLake Acquisition
Growing
Importance
Data Quality and Governance
Embrace throughout
Business Disciplines
Automated data management &
administration
Organization maturity
Phase 1
(Infancy)
Phase 2
(Technical adoption)
Phase 3
(Business adoption)
Phase 4
(Data&Analytic as a Service)
Phase component
Real-time
dashboard(s)
Algorithm embedded
dashboard(s)
Algorithm Performance
dashboard(s)
Visualization of deliveries
Pattern
detecting
Unsupervised
learning
Supervised Learning
Recommender
System(s)
Deep Learning
Possible type of ML used in each phase
Data exploration
Experimental
design
Map data sources vs
customers touch points
Acquire solution for
architecture
Control data
Quality
merge data sources and
automise processing
Design experiment – extract
preference data
Platform maturity
(data + technology)
Pipe-line data processing &
application flow
43. Traditional
BI
Data- Driven
Organization
& Products
Data silos –
Fragmented data views
Resistance to
Change
Isolated
Acceptance
DataLake Acquisition
Growing
Importance
Data Quality and Governance
Embrace throughout
Business Disciplines
Automated data management &
administration
Organization maturity
Phase 1
(Infancy)
Phase 2
(Technical adoption)
Phase 3
(Business adoption)
Phase 4
(Data&Analytic as a Service)
Phase component
Real-time
dashboard(s)
Algorithm embedded
dashboard(s)
Algorithm Performance
dashboard(s)
Visualization of deliveries
Pattern
detecting
Unsupervised
learning
Supervised Learning
Recommender
System(s)
Deep Learning
Possible type of ML used in each phase
Data exploration
Experimental
design
Map data sources vs
customers touch points
Acquire solution for
architecture
Control data
Quality
merge data sources and
automise processing
Design experiment – extract
preference data
Platform maturity
(data + technology)
Pipe-line data processing &
application flow
One-off
(Proof Of Concepts=POC)
Production PipeLine
45. Data engineer
Business
knowledge Data scientist IT support
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deliverables
One-Off
iterations
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deployment
Apply to
Application
Production
Pipelines
Performance Optimization
Enable
automization
47. Data engineer
Business
knowledge Data scientist IT support
One-Off
iterations
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deployment
Apply to
Application
Production
Pipelines
Performance Optimization
Enable
automization
70-80% 10%~20%
48. comparison
Oragnization maturity
What are they looking for
Project scope
Platform & technology
Data source availbility
Data quality
Deliverbles
One-off
phrase 1 phrase 2
To understand how data science
work (baby step)
Small 4 -8 weeks
Do not change anything existed
inhouse
Mainly DB + 1 or 2 additional
datasource
Poor, need lots of clearning
Focus in intepretation(visualized)
Production
Pipeline
Phrase +2 and forward
Participate in data science
process
At least 3 months and above
Consider or already migrate to
new platform/technology
Start to map out all available
datasources
Start to sort out data quality
Focus on code( hence limitation
on programming language)
49. Data Science Process –
Box-in the activities overview
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
50. Define
Business
Question
Define the
goal
Decompose
the question
Verify
understanding
Project
Scoping
Map data
sources
Establish
performance
measure
Data scientist
Workspace
Task Force
Business
limitation
Define project
scope
Data
acquisition &
Preparation
Environment
set up
Languages:
SQL, R,
Python…etc
Data sources
merging
Data pre-scan
Q&A
Data Quality
review
Descriptive
statistics
(data
exploration)
Explore data
(plots)
Data
manipulation
Outliers/NA s
summary
statistics
Data explore
review
Features
Engineering
Establish
performance
threshold
Features
engineering
Algorithms
selection
Bueinss sign off
Model
building &
validation
Type of models
Model selection
criteria
Build and
Validate the
model
Review results
Deploy
/deliverables
To whom
On what platform
Update
Frequency
Performance
review
Infographic(visual
ization)
Deployment
review
51. Step-wised Data Science Process :
from Business Question Scoping
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
52. questions
How to get
the data
(access)
done
Datalake
Environment
set up
issues
Extract
Next : About Data
SpecifyNot
ready
?
The Scope
1. thresholds
2. Data scope
3. Resource
4. taskforce
5. Limitations
6. Budget &
timeline…etc
define
NOT done
Ready
Question Scope
53. Step-wised Data Science Process :
Data acquisition data preparation
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
54. Main table
(PK= Transaction ID
FK=StoreID )
Acquire data – merge the data sources
Customer Interests
(PK=email address)
6.Joined by email
Data source type : social
3.Joined by StoreID
Promotions : campaign name,
campaign duration, in which store,
discout level…etc
(PK=CampaignID,
FK=StoreID)
Data source type : campaign tool
1.Joined by
TrasactionID
Customer Purchase informaiton
(PK=CID
FK=Transaction ID)
Customer Database
(PK=CID
FK=email)Joined by CID
Data source type : DatabaseData source type : Database
4.Joined by
StoreID
Store Survey : questions, scale of
satisfaction, product rating..etc
(PK=SurveyID,
FK=StoreID)
Data source type : Survey tool
Store Geo Info: location, km to center, km
to customer’s address, kms to competitor’s
store in the same postcode region…etc
(PK=StoreID)
5.Joined by
StoreID
Data source type : API calls
2.Joined by
Transaction IDWebsite Browsing :
Pages viewed, avg time on site ,
product browsed..etc
(PK=CookieID,
FK=TrasactionID)
Data source type : clickstream
The GOAL
55. Step-wised Data Science Process :
Descriptive Statistics
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
56. A flower called iris
3
Sentosa Virginica Versicolor
Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
62. Step-wised Data Science Process:
Features engineering
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
63. - Observation from Descriptive Statistics
- Remove highly correlated columns/parameters
(example slides further down the presentation)
- Candidate models’ requirement ?
- Some model requires you to do One-Hot-Encoding ( example Neural Network, PCA , Kmeans clusering )
- Outliers sensitive or not ? ( example: regression models are more sensitive to outliers than tree models)
- Forward stepwised /Backward stepwised / shrinkage selection concepts vs.
Blackbox model rank features importance ?
- Computing time vs. response
- Business limitations
( example, business equire to shink the features to <=20 )
Feature Selection ( things to consider)
64. Example (justifying selected features)
Background :
you’ve done an exploratory analysis about correlation,
you have the result and now you need to explain it in a 5-
year’s-old-can-understand way and use the exploratory
results to do your feature selection !
66. Observation Interval of distance
Direction to the right
A B
Highly correlated(0.75~1) : Tesla car and Volvo car moving almost at the same speed and toward the same
direction
Negatively correlated(<0) : Tesla car and Volvo car moving toward different directions
Positively correlated (0.5 ~0.75) : Tesla car move a bit faster than Volvo car but they are still both heading
at the same direction
explain Correlation with a metaphor continued
67. Linear
Correlation
In the following slides, for intuitive
convenience purpose we rescale
and map the correlation coefficient
into the % format - - -
Example :
Strong positive correlation :
1 100%
where:
is the covariance of varible X and Y
is th standard deviation of X
is th standard deviation of Y
Pearson’s correlation :
68. The result of the analysis
Externalsheettempexhaustpipe
External sheet temp exhaust pipe
Actual exhaust temperature exhaust pipe
Actualexhausttemperatureexhaustpipe
Process value regulator under pressure
Processvalueregulatorunderpressure
Process value regulator hood damper
Processvalueregulatorhooddamper
Negative pressure exhaust pipe
Negativepressureexhaustpipe
Regulator value hood damper
Regulator value exhaust damper
Actualvaluedamperexhaustpipe
Regulatorvalueexhaustdamper
Regulatorprocessvalue
Actual value damper exhaust pipe
69. Before we leave this metaphor –
one last thing :
” correlation does not impley causation ! ”
70. Correlation does not imply causation !
Question : Why did these two cars (Tesla car and Volvo car) move toward the same direction in the first place?
Guess 1 : husband and wife
I drive
Tesla car
I drive
Volvo car
Guess 2 : racing track
A B
A B
Guess 3 : coincidence
71. Before diving into training your model(s) …
ask yourself :
what type of model should I use ?
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
72. Question :
Do you have the correct
answer to a given
business question ?
Supervised learning
Regressions
Classes
Unsupervised learning
Deep learning
Clustering
Association analysis
What type of models are suitable ?
YES
NO
73. Before diving into training your model(s) …
Models landscape
1. Supervised
2. Unsupervised
3.Deep learning
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
74. Supervised Learning
Regressions:
Linear Regression
Step-wised Regression
Piecewise Polynomials and splines
Smoothing Splines
Logistic Regression
Multivariate Adaptive Regression Splines
Least Absolute Shrinkage and Selection Operator (LASSO)
Ridge Regression
Linear Discriminant Analysis (LDA)
Trees :
Decision trees
Gradient Boosted Regression trees
Adaptive Boosting trees (AdaBoost)
Conditional Inference trees (CI trees)
Bootstrap Aggregation (Bagging) trees
Gradient Boosted Machines(GBM)
Random Forest (RF)
Support Vector Machines (SVM) :
Support vector classifier (two class)
Support vector classifier (multiclass)
Kernels and support vector machines
Dimensionality reduction:
Principal Component Analysis(PCA)
Singular Value Decomposition (SVD)
MinHash
Locality Sensitive Hashing(LSH)
t-Distributed Stochastic Neighbor embedding (t-SNE)
Clustering :
Kmeans Clustering
Hierarchical Clustering
Bradley-Fayyad-Reina (BFR) clustering
Clustering Using REpresentatives CURE clustering
Bayesian networks
Topic modelling
Market Basket :
Apriori (association rules)
Park Chen and Yu algorithm (PCY)
Savasere, Omiecinski and Navathe (SON)
Toivonen’s algorithm
Stream Analysis :
Bloom filters
Flajolet-Martin Algorithm
Alon-Matias-Szegedy
Datar-Gionis-Indyk-Motwani algorithm
Unsupervised Learning
NeuralNetwork families
Deep Learning
Perceptrons
Simple Neural Networks (fully connected )
Deep Boltzmann machines
Convolutional neural networks
Recurrent neural networks
Hierarchical temporal memory
Genetic algorithm (chromosome)
Multi-arm bandit
K’s Nearest Neighbors (KNN)
Content based recommender
User-User recommender
Item-item recommender
Hybrid recommender
Latent Dirichlet Allocation recommender
Recommender Systems
Others
Others
75. Data Science Process :
Model training Model Validation
( example : supervised learning)
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
76. pre-processed data
Validation
set
Training set Test set
Split into
Train
ML models
Check
Select one winning model
Models that pass the
testing set
Winning
model
Monitor model
performance
Re-train
the
models ?
Yes
No
decide
Sampling from
live data
streams
If we want to be REALLY picky
Live testing the
winning model
77. data science process
Model selection criteria
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
78. Example ( justifying how you select the model)
Background:
you built a prediction model (let’s say to classify customer
purchase=Yes/No), now you need to explain why did you
pick THAT alrogithm in the first place !
79. criterias logistic trees RF GBM weights
Performance
=Accuracy 86,5% 86,7% 86,8% 85,8% 10%
Sensitivity
4,6% 12,5% 8,4% 21,4% 20%
interpretability 1 0,8 0,4 0,2 30%
Time to
compute 1 0,8 0,2 0,2 20%
# of
parameters 2,4 2,4 1,89 2,38 10%
Conflict to use
regression Yes partial minimum minimum 10%
Ranking 1,016 1,063 0,625 0,894 100%
Performance=(true positive+true negative)/test set’s population the model correctly predicted on Both whether you are a Purchaser or NonPurchaser
Sensitivity =True positive/all positives on test set the model correctly predicted that you are going to purchase
Construct criteria for model selection – input both from business as well as data characteristicsNone of the Numerical data is normally distributed
80. Data Science Process :
explain your model
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
81. Example (explaion the selected the model)
Background:
Now I have select a model called recursive Partition tree (rPart),
the stakeholders asked me to explain how this model works …
82. High level - Conceptually
Medium level - a bit more detail
Recurssive Partitioning Tree (rPart)–
How does it work ?
Explained in 2 levels . . .
84. High level – rPart how does it work ?
Parent node
Use both criteria 1 & 2
to decide whether to
split or not
Child node 2.1
(repeat the same
thing)
Child node 2.2
(repeat the same
thing)
For every parameters Pi , check
1) Is spliting on Pi with value Xi
gives me more information ?
2) Is split on Pi with value Xi
gives me better accuracy for prediction?
Note: information is defined by inforamtion theory and have the
option of Gini index and information gain( link )
• Minisplit - the minimum number of
observations that must exist in a node in order for a
split to be attempted
• Minibucket-minimum observation in terminal
node =minsplit/3
• cp- complexity parameter,punish the model if too
many parameters will used and not much of
increasing of accuracy/information were achieved
Criteria 1 Criteria 2
Split on Parameter Pi
with value Xi
YESNO
… …
Tree Split nodes on : Hyper-parameters
85. Medium Level – a bit more detail
1) information gain 2) accuracy improvement
86. Scenario 2 :
If the end nodes have 100 percent of the chance to say that a class to be
Purchaser or noPurchaser, it is perfect classification, hence this node is said to
be reaching minimum impurity (entropy=0)
calculation formula :
-P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase))
-P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase))
=0 -(1)*log2(1) +0 =0 minimum impurity
Scenario 1 :
If the end nodes have 50-50 percent of the chance to say that a class to be
Purchaser or noPurchaser, it is as good as ’guess’, hence this node is said to be
reaching maximum impurity (entropy=1)
calculation formula :
-P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase))
-P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase))
=0 -(1/2)*log2(1/2)-(1/2)*log(1/2)+0 =1 maximum impurity
1) Information gain by checking the Impurity of the end nodes calculated by entropy
Total: 10 data points
Label :
5 Purchase+ 5 noPurchase
(end node1)
Total: 5 data points
Label :
0 Purchase+5 noPurchase
(end node2)
Total: data points
Label :
5 Purchase+ 0 noPurchase
Spliting condition 1Yes No
Scenario 1
P1(Purchase)=0
P1(noPurchase)=5/10 =1/2
P2(Purchase)=5/10
P(noPurchase)=0/10
Total: 10 data points
Label :
0 Purchase+ 10 noPurchase
(end node1)
Total: 5 data points
Label :
0 Purchase+10 noPurchase
(end node2)
Total: data points
Label :
0 Purchase+ 0 noPurchase
Spliting condition 1Yes No
Scenario 2
P1(Purchase)=0
P1(noPurchase)=10/10=1
P2(Purchase)=0
P(noPurchase)=0
0
87. 2) how rpart calculating misclassification rate on parameter Pi with value Xi
20 data
points
10 data
points
10 data
points
Age <45?Yes No
Predict
noPurc
hase
=7
Predict
Purchase
=3
cntTotal <110?Yes No
Correct
classified
rate =1/7
Correct
classified
rate =1/3
Predict
noPurc
hase
=5
Predict
Purchase
=5
cntTotal <75 ?Yes No
Correct
classified
rate =1/5
Correct
classified
rate =1/5
rPart model will ask for each and every value Xi in
a parameter Pi
Was it a good idea´(via calculate the
missclassification rate) to split on this value and it
will do so for all parameter Pi on all possible value
Xi associated with Pi (see image on the left as an
example )
Overall misclassification rate
(True Purchase + true noPurchase) / total population
= 4/20 =20%
Misclassified =1- 20% =80%
88. Data Science Process :
deployment
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
89. Board members /
CTO, CEO, CFO..etc
Marketing directors, Marketers
Processed data for
visualization
Data
scientist
Model Performance
Matrices & output
prediction
pass
business
owner’s
vision
Deliverables: One-off (POC)
Interpretability
Lesson learned - Final
reports or prototype
dashboards for internal
sales
WoW-effect Visualization
90. IT + Content
creators +
marketers
Processed data for
visualization
Data
scientist
Code for
embeddedment into
applications
Model Performance
Matrices & output
prediction
Pass
integration
test
Deployment : Production Pipeline
Reproducibility
Add to organization-wide
dashboards&reporting
pipeline (automated)
Embedded code directly
into applications
( content recommender, product mix vs
customer segments matching..etc)
Use the output of model
prediction for further
marketing purpose
( such as segmentation, customer
profiling..etc)
Process efficiency
92. Refresh our memory from previous sections
• Relationship between data science and big data
• What motivate me to become a data scientist
• The definition of data science and it’s closely related
synonym
• The skillset map for becoming a data scientist ( unicorn
version vs your own)
o Why team work approach
o Dream team mates
o Data science process : two approach ( why , compare ,
boxed-in activities)
o Data science process breakdown in details (step-wised)
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
Cloud computing : Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand
Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources.
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
Cloud computing : Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand
Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources.
Now i want you to spend some time to read about this slide so i can drink some water because i am thirsty .. :P
Oki so motivation is very personal.. You need to find yours.. Here are mine…
I am extrememly attracted to knowledge.. In fact every time i found something interesting.. I can’t just let it pass, i need to stay with it until i know more, or enough that satisfy my hunger for knowledge.. … i dont know about you but for me.. this need of knowing more drives me to go further….
Secondary , it sounds a bit cliché to say that the beautiful thing about learning is that no one can take it away from you ….
Oki, so if you really think about it, it is true in that, well in this world, we are all alone, we can try to keep those who we care about close, we can try to build the most secure locker in the world…
Eventually, things, people leave us… the only thing that you are stuck with, is yourself and the knowledge you know… in a way , it is both sad and nice..
So the third picture is quite curious… does anyone knows who made this art ?
https://en.wikipedia.org/wiki/Waterfall_(M._C._Escher)
So anyone wants to guess why i choose this picture ?
Things are not always what it seems at the first glace when you look a bit longer, you will realize something is off… then you will ask yourself why is it so..
This is exactly the point, it challenges you to think outside the box , we live in a world with conditions… everything comes with conditions that we are not even conciously aware of… for example. We restrict ourselve to think in a less than 3 dimensional ways.. That we get confused when dimensions grows higher than 3… what if we are allow to go to the 4th or 5th dimensions , what will happen ?
Another way to think about it is that , we awwume gravity exists even in pictures.. Oki so says that it SHOULD exist at all costs ? What if we try to surreal ..
This concept of challenge your fundemental ’’bias’’ extend to everything you do as a data scientist.. Remember that i said in the data science skillset map that you need to be naive and stupid ? Ask questions about why it is so.. Why it is done like this is actually important, it sometimes reveal hidden truth
So anyone can tell me what is the difference between these two picture ?
We have two cars, one tesla car and one volvo car here.
During the interval of this distance ( from point A to point B) , we know that these two cars are both moving toward the direction to the right at almost the same ” speed
We know that when we observe these two car from point A to point B, we can see that these two car will arrive approximated at the same place and they move alone the path quite simontenously syncronized
Now this could be due to that there were a husband and a wife ( both own a car) were driving home together , it could be these two cars were in a racing track
It could be completedly coincidential , two strangers were just happen to join together in this road toward the same direction within this observed path A to B
Now since we do not been given enough information, we have no idea which of these scenario it is .. The only valid conclusion we could draw from this is that
When we observed car tesla car and the volvo car, we know that these two car move together almost syncronized in speed and time ( which translate to the distance they covered is quite similar as well)
So if we know the fact that we will eventually get tesla car when we standing at point B, we know that we will also have volvo car there as well when we see the tesla car
Now we only need to know one of the car ( either tesla or volvo ) when we are at point B to determined how many distance these two cars covered ( since they arrive at the same point B at almost at the same time.. So we actually can just pick one..
This means that these two cars are positively correlated and their correlation is quite strong , approach to 1 since they are moving toward the same direction quite simuteneously
Now think about the fact that we did not know if these two cars happend to move toward the same direction simutenously by accident or if there is some scenarios behind the scene that is yet to discover.. Which means that correlation ( either positively or negatively does not mean causation )
So why is it important for feature engineering to know this ?
Oki, so let’s say that we want to know fuel consumption efficiency with cars, we then should NOT take tesla car into cosideration, since tesla car did not even use fuel
Hence it will just comfused the model i build, the model could not possibly know why tesla car has only zero as values through and through..when it comes to fuel comsumption
Hence it is actually harmful to not carefully select your features
We have two cars, one tesla car and one volvo car here.
During the interval of this distance ( from point A to point B) , we know that these two cars are both moving toward the direction to the right at almost the same ” speed
We know that when we observe these two car from point A to point B, we can see that these two car will arrive approximated at the same place and they move alone the path quite simontenously syncronized
Now this could be due to that there were a husband and a wife ( both own a car) were driving home together , it could be these two cars were in a racing track
It could be completedly coincidential , two strangers were just happen to join together in this road toward the same direction within this observed path A to B
Now since we do not been given enough information, we have no idea which of these scenario it is .. The only valid conclusion we could draw from this is that
When we observed car tesla car and the volvo car, we know that these two car move together almost syncronized in speed and time ( which translate to the distance they covered is quite similar as well)
So if we know the fact that we will eventually get tesla car when we standing at point B, we know that we will also have volvo car there as well when we see the tesla car
Now we only need to know one of the car ( either tesla or volvo ) when we are at point B to determined how many distance these two cars covered ( since they arrive at the same point B at almost at the same time.. So we actually can just pick one..
This means that these two cars are positively correlated and their correlation is quite strong , approach to 1 since they are moving toward the same direction quite simuteneously
Now think about the fact that we did not know if these two cars happend to move toward the same direction simutenously by accident or if there is some scenarios behind the scene that is yet to discover.. Which means that correlation ( either positively or negatively does not mean causation )
So why is it important for feature engineering to know this ?
Oki, so let’s say that we want to know fuel consumption efficiency with cars, we then should NOT take tesla car into cosideration, since tesla car did not even use fuel
Hence it will just comfused the model i build, the model could not possibly know why tesla car has only zero as values through and through..when it comes to fuel comsumption
Hence it is actually harmful to not carefully select your features