SlideShare a Scribd company logo

ch2 DS.pptx

D
derbew2112

Data Science

1 of 34
Download to read offline
Chapter 2
Data
● Data science is now one of the most influential topics all around.
● Companies and enterprises are focusing a lot on gathering data
science talent further creating more viable roles in the data science
industry.
● Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge
and insights from structured, semi-structured and unstructured data.
● Example: The data involved in buying a box of cereal from the store or
supermarket
Introduction
Data Science vs Data scientist
• Data Science defined as the extraction of actionable
knowledge directly from the data through the process
of discovery, hypothesis, and analytical hypotheses
analysis.
• It is a process of effectively producing or helping to
produce some tool, method, or other product that
derives intelligence from datasets too large.
Data Science vs Data scientist
• A data scientist (is a job title) is a person engaging in a
systematic activity to acquire knowledge from data.
• In a more restricted sense, a data scientist may refer to
an individual who uses the scientific method on
existing data.
• Data Scientists perform research toward a more
comprehensive understanding of products, systems, or
nature, including physical, mathematical and social
realms.
Role of a Data Scientist
• Advance the skills of analyzing large amounts of data,
data mining, and programming skills.
• The processed and filtered data are handed to them
which are then fed to various analytics programs and
machine learning with statistical methods to generate
data which will soon be used in predictive analysis and
other fields
• Explore for more cryptic patterns to procure proper
insights.

Recommended

More Related Content

Similar to ch2 DS.pptx (20)

Data Mining-2023 (2).ppt
Data Mining-2023 (2).pptData Mining-2023 (2).ppt
Data Mining-2023 (2).ppt
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoT
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Information system
Information systemInformation system
Information system
 
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEMM. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdf
 
Data science
Data scienceData science
Data science
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
RowanDay4.pptx
RowanDay4.pptxRowanDay4.pptx
RowanDay4.pptx
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptx
 
semana1.pptx
semana1.pptxsemana1.pptx
semana1.pptx
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Modern Information Systems
Modern Information SystemsModern Information Systems
Modern Information Systems
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Information systems
Information systemsInformation systems
Information systems
 
Digital data
Digital dataDigital data
Digital data
 
Digital Types
Digital TypesDigital Types
Digital Types
 
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
 
Creating Effective Data Visualizations in Excel 2016: Some Basics
Creating Effective Data Visualizations in Excel 2016:  Some BasicsCreating Effective Data Visualizations in Excel 2016:  Some Basics
Creating Effective Data Visualizations in Excel 2016: Some Basics
 
KIT601 Unit I.pptx
KIT601 Unit I.pptxKIT601 Unit I.pptx
KIT601 Unit I.pptx
 

Recently uploaded

AWS Identity and access management for users
AWS Identity and access management for usersAWS Identity and access management for users
AWS Identity and access management for usersStephenEfange3
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxPoonamRijal
 
Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)CUO VEERANAN VEERANAN
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsDataArchiva
 
SABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referenceSABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referencepriyansabari355
 
Business Analytics _ Confidence Interval
Business Analytics _ Confidence IntervalBusiness Analytics _ Confidence Interval
Business Analytics _ Confidence IntervalRavindra Nath Shukla
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Thibaud Le Douarin
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxHizkiaJastis
 
What is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptxWhat is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptxJose Briones
 
Industry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxIndustry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxMdRafiqulIslam403212
 
Lies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaLies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaAdrian Sanabria
 
SABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referenceSABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referencepriyansabari355
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfAustraliaChapterIIBA
 
Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023stephizcoolio
 

Recently uploaded (17)

AWS Identity and access management for users
AWS Identity and access management for usersAWS Identity and access management for users
AWS Identity and access management for users
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptx
 
Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data Goals
 
SABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referenceSABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as reference
 
Business Analytics _ Confidence Interval
Business Analytics _ Confidence IntervalBusiness Analytics _ Confidence Interval
Business Analytics _ Confidence Interval
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptx
 
What is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptxWhat is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptx
 
Industry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxIndustry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptx
 
Lies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaLies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix Enigma
 
SABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referenceSABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a reference
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
 
Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptxElectricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx
 
Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023
 

ch2 DS.pptx

  • 3. ● Data science is now one of the most influential topics all around. ● Companies and enterprises are focusing a lot on gathering data science talent further creating more viable roles in the data science industry. ● Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured, semi-structured and unstructured data. ● Example: The data involved in buying a box of cereal from the store or supermarket Introduction
  • 4. Data Science vs Data scientist • Data Science defined as the extraction of actionable knowledge directly from the data through the process of discovery, hypothesis, and analytical hypotheses analysis. • It is a process of effectively producing or helping to produce some tool, method, or other product that derives intelligence from datasets too large.
  • 5. Data Science vs Data scientist • A data scientist (is a job title) is a person engaging in a systematic activity to acquire knowledge from data. • In a more restricted sense, a data scientist may refer to an individual who uses the scientific method on existing data. • Data Scientists perform research toward a more comprehensive understanding of products, systems, or nature, including physical, mathematical and social realms.
  • 6. Role of a Data Scientist • Advance the skills of analyzing large amounts of data, data mining, and programming skills. • The processed and filtered data are handed to them which are then fed to various analytics programs and machine learning with statistical methods to generate data which will soon be used in predictive analysis and other fields • Explore for more cryptic patterns to procure proper insights.
  • 7. Data Science • Scientific method requires data to begin iterating towards a more convincing hypothesis. • Science doesn’t exist without data. • Data scientist • possess a strong • Quantitative background in statistics • Linear algebra • Programming knowledge with focuses on data warehousing, mining, and modeling to build and analyze algorithms
  • 8. Algorithms • An algorithm is a set of instructions designed to perform a specific task. • This can be a simple process, such as multiplying two numbers, or a complex operation, such as playing a compressed video file. • Search engines use proprietary algorithms to display the most relevant results from their search index for specific queries.
  • 9. Data vs. Information • Data • Can be defined as a representation of facts, concepts, or instructions in a formalized manner, which should be suitable for communication, interpretation, or processing, by human or electronic machines. • It can be described as unprocessed facts and figures • It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc. • Information • The processed data on which decisions and actions are based • Information is interpreted data; created from organized, structured, and processed data in a particular context
  • 10. Data Processing Cycle • Data processing is the conversion of raw data to meaningful information through a process. • Data is manipulated to produce results that lead to a resolution of a problem or improvement of an existing situation. • The process includes activities like data entry/input, calculation/process, output and storage • Input is the task where verified data is coded or converted into machine readable form so that it can be processed through a computer. Data entry is done through the use of a keyboard, digitizer, scanner, or data entry from an existing source.
  • 11. Data Processing Cycle • Processing is when the data is subjected to various means and methods of manipulation, the point where a computer program is being executed, and it contains the program code and its current activity. • Output and interpretation is the stage where processed information is now transmitted to the user. Output is presented to users in various report formats like printed report, audio, video, or on monitor. • Storage is the last stage in the data processing cycle, where data, instruction and information are held for future use. The importance of this cycle is that it allows quick access and retrieval of the processed information, allowing it to be passed on to the next stage directly, when needed.
  • 12. Data types • A data type is way to tell compiler as to which data (integer, character, float, etc.) is supposed to be stored and what amount of memory consequently to allocate. • A data type is way to tell the compiler that at a cell x in a memory space, a bit value of some range y is only supposed to be stored. It restricts the compiler to store anything else other than that value range • Common data types include • Integers(int)- is used to store whole numbers, mathematically known as integers • Booleans(bool)- is used to represent restricted to one of two values: true or false • Characters(char)- is used to store a single character • Floating-point numbers(float)- is used to store real numbers • Alphanumeric strings(string)- used to store a combination of characters and numbers
  • 13. Data representation • Types are an abstraction letting us model things in categories and it is largely a mental construct. • All computer represent data nothing more than a string of ones and zeroes. • In order for said ones and zeroes to convey any meaning, they need to be contextualized. • Data types provide that context. • E.g. 01100001
  • 14. Data types from Data Analytics perspective • Data analytics (DA) is that the method of examining knowledge sets to conclude the data they contain, progressively with the help of specialized systems and software package • From a data analytics point of view, it is important to understand that there are three common types of data types or structures: • Structured, • Semi-structured, and • Unstructured data types
  • 15. Structured Data • Structured data is data that adheres to a pre-defined data model and is therefore straightforward to analyze. • Structured data concerns all data which can be stored in database SQL in table with rows and columns. They have relational key and can be easily mapped into pre-designed fields. • Structured data is highly organized information that uploads neatly into a relational database • Structured data is relatively simple to enter, store, query, and analyze, but it must be strictly defined in terms of field name and type
  • 16. Unstructured Data • Unstructured data is information that either does not have a predefined data model or is not organized in a pre-defined manner. • Unstructured data may have its own internal structure, but does not conform neatly into a spreadsheet or database. • Most business interactions, in fact, are unstructured in nature. • Today more than 80% of the data generated is unstructured. • The fundamental challenge of unstructured data sources is that they are difficult for nontechnical business users and data analysts alike to unbox, understand, and prepare for analytic use.
  • 17. Semi structured Data • Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless, contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. • Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. • Examples of semi-structured : CSV but XML and JSON documents are semi structured documents, NoSQL databases are considered as semi structured.
  • 18. Metadata – Data about Data • Metadata is data about data. Data that describes other data. • It provides additional information about a specific set of data. • Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier. • For example, author, date created and date modified and file size are examples of very basic document metadata. • Having the ability to filter through that metadata makes it much easier for someone to locate a specific document. • In context of databases, metadata would be info on tables, views, columns, arguments etc.
  • 19. Data value chain • The Data Value Chain is introduced to describe the information flow within a big data system as a series of steps needed to generate value and useful insights from data. • Data acquisition, data analysis, data curation, data storage, data usage • Data acquisition is the process of digitizing data from the world around us so it can be displayed, analyzed, and stored in a computer. It is the processes for bringing data that has been created by a source outside the organization, into the organization, for production use.
  • 20. Data value chain • Data analysis is the process of evaluating data using analytical and logical reasoning to examine each component of the data provided. Data from various sources is gathered, reviewed, and then analyzed to form some sort of finding or conclusion. • Data analytics is process of finding information from data to make a decision and subsequently act on it.
  • 21. Data value chain • Data curation is about managing data throughout its lifecycle. Collecting, organizing, cleaning and much more are included in data curation. Data curators manage the data through various stages and make the data usable for data analysts and scientists. • Data storage is defined as a way of keeping information in the memory storage for use by a computer. An example of data storage is a folder for storing Microsoft Word documents. • Data usage is the amount of data (things like images, movies, photos, videos, and other files) that you send, receive, download and/or upload.
  • 23. Big Data Definition • No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…
  • 24. Big data • Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. • In other words, data that is the range of 100s of TBs or PB comes into Big Data. • But it doesn't mean the amount of data, the thing matters is what organization do with data. • Big Data is analyzed for insights that lead to better decisions.
  • 26. Big data • Big Data is associated with the concept of 3 V that is volume, velocity, and variety. Big data is characterized by 3V and more: • Volume: large amounts of data Zeta bytes/Massive datasets • Velocity: Data is live streaming or in motion • Variety: data comes in many different forms from diverse sources • Veracity: can we trust the data? How accurate is it? etc.
  • 27. Clustered Computing • Cluster Computing addresses the latest results in these fields that support High Performance Distributed Computing . • The Clustering methods have identified as- HPC IAAS, HPC PAAS, that are more expensive and difficult to setup and maintain than a single computer. • In HPDC environments, parallel and/or distributed computing techniques are applied to the solution of computationally intensive applications across networks of computers.
  • 28. Clustered Computing • “Computer cluster” basically refers to a set of connected computer working together. • The cluster represents one system and the objective is to improve performance. • The computers are generally connected in a LAN (Local Area Network). • So, when this cluster of computers works to perform some tasks and gives an impression of only a single entity, it is called “cluster computing”.
  • 29. Clustered Computing • Big data clustering software combines the resources of many smaller machines, seeking to provide a number of benefits: • Resource Pooling: • Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling are also extremely important. Processing large datasets requires large amounts of all three of these resources. • Object Pooling is a way which enable storing of group of object(called pool storage) in memory. • Whenever new object is needs to be created, it is first checked in pool storage and if available it is reused and like this it provide reusability of object and system resources, improves the scalability of program.
  • 30. Clustered Computing • High Availability: In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time. • Clusters can provide varying levels of fault tolerance and availability guarantees to prevent hardware or software failures from affecting access to data and processing. This becomes increasingly important as we continue to emphasize the importance of real-time analytics. • Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the group. This means the system can react to changes in resource requirements without expanding the physical resources on a machine.
  • 31. Hadoop and its Ecosystem • Hadoop is an open-source framework intended to make interaction with big data easier. It is a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models. • The four key characteristics of Hadoop are: • Economical: Its systems are highly economical as ordinary computers can be used for data processing. • Reliable: It is reliable as it stores copies of the data on different machines and is resistant to hardware failure. • Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in scaling up the framework. • Flexible: It is flexible and you can store as much structured and unstructured data as you need to and decide to use them later.
  • 32. ● Hadoop has an ecosystem that has evolved from its four core components: data management, access, processing, and storage. ● It is continuously growing to meet the needs of Big Data. ● It comprises the following components and many others: ○ HDFS: Hadoop Distributed File System ○ YARN: Yet Another Resource Negotiator ○ MapReduce: Programming based Data Processing ○ Spark: In-Memory data processing Hadoop and its Ecosystem ○ PIG, HIVE: Query-based processing of data services ○ HBase: NoSQL Database ○ Mahout, Spark MLLib: Machine Learning algorithm libraries ○ Solar, Lucene: Searching and Indexing ○ Zookeeper: Managing cluster ○ Oozie: Job Scheduling
  • 33. Big Data Life Cycle
  • 34. End of chapter 2 Does anyone have any questions? THANKS