The document provides an overview of database, big data, and data science concepts. It discusses topics such as database management systems (DBMS), data warehousing, OLTP vs OLAP, data mining, and the data science process. Key points include:
- DBMS are used to store and manage data in an organized way for use by multiple users. Data warehousing is used to consolidate data from different sources.
- OLTP systems are for real-time transactional systems, while OLAP systems are used for analysis and reporting of historical data.
- Data mining involves applying algorithms to large datasets to discover patterns and relationships. The data science process involves business understanding, data preparation, modeling, evaluation, and deployment
Very basic Introduction to Big Data. Touches on what it is, characteristics, some examples of Big Data frameworks. Hadoop 2.0 example - Yarn, HDFS and Map-Reduce with Zookeeper.
Very basic Introduction to Big Data. Touches on what it is, characteristics, some examples of Big Data frameworks. Hadoop 2.0 example - Yarn, HDFS and Map-Reduce with Zookeeper.
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
M G N A S Fernando, G N Wikramanayake (2004) "Application of Data Warehousing and Data Mining to Exploitation for Supporting the Planning of Higher Education System in Sri Lanka", In:23rd National Information Technology Conference, pp. 114-120. Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
Big data analytics is the process of examining large data sets containing a variety of data types i.e., big data to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. The analytical findings can lead to more effective marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organizations and other business benefits. Enterprises are increasingly looking to find actionable insights into their data. Many big data projects originate from the need to answer specific business questions. With the right big data analytics platforms in place, an enterprise can boost sales, increase efficiency, and improve operations, customer service and risk management. Notably, the business area getting the most attention relates to increasing efficiencies and optimizing operations. By using big data analytics you can extract only the relevant information from terabytes, petabytes and exabytes, and analyse it to transform your business decisions for the future. Becoming proactive with big data analytics isn't a one-time endeavour, it is more of a culture change – a new way of gaining ground.
Keywords: business, analytics, exabytes, efficiency, data sets
Difference between data warehouse and data miningmaxonlinetr
What exactly is a Data Warehouse?
Termed as a special type of database, a Data Warehouse is used for storing large amounts of data, such as analytics, historical, or customer data, which can be leveraged to build large reports and also ensure data mining against it.@ http://maxonlinetraining.com/why-is-data-warehousing-online-training-important/
What is Data mining?
The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions’
Call us at For any queries, please contact:
+1 940 440 8084 / +91 953 383 7156 TODAY to join our Online IT Training course & find out how Max Online Training.com can help you embark on an exciting and lucrative IT career.
TODAY to join our Online IT Training course & find out how Max Online Training.com can help you embark on an exciting and lucrative IT career.
Visit www.maxonlinetraining.com
For Complete Course Overview and to a book @https://goo.gl/QbTVal
Data Mining With Excel 2007 And SQL Server 2008Mark Tabladillo
Introduction to Excel 2007 Data Mining Plug-In using SQL Server 2008. The presentation starts with definitions and statistical theory (without equations). Then, the audience interactively participates in four demos showing the power and possibilities of the Microsoft Data Mining Algorithms.
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
What is Big Data and why it is required and needed for the organization those who really need and generating huge amount of data and when it will be use
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
M G N A S Fernando, G N Wikramanayake (2004) "Application of Data Warehousing and Data Mining to Exploitation for Supporting the Planning of Higher Education System in Sri Lanka", In:23rd National Information Technology Conference, pp. 114-120. Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
Big data analytics is the process of examining large data sets containing a variety of data types i.e., big data to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. The analytical findings can lead to more effective marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organizations and other business benefits. Enterprises are increasingly looking to find actionable insights into their data. Many big data projects originate from the need to answer specific business questions. With the right big data analytics platforms in place, an enterprise can boost sales, increase efficiency, and improve operations, customer service and risk management. Notably, the business area getting the most attention relates to increasing efficiencies and optimizing operations. By using big data analytics you can extract only the relevant information from terabytes, petabytes and exabytes, and analyse it to transform your business decisions for the future. Becoming proactive with big data analytics isn't a one-time endeavour, it is more of a culture change – a new way of gaining ground.
Keywords: business, analytics, exabytes, efficiency, data sets
Difference between data warehouse and data miningmaxonlinetr
What exactly is a Data Warehouse?
Termed as a special type of database, a Data Warehouse is used for storing large amounts of data, such as analytics, historical, or customer data, which can be leveraged to build large reports and also ensure data mining against it.@ http://maxonlinetraining.com/why-is-data-warehousing-online-training-important/
What is Data mining?
The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions’
Call us at For any queries, please contact:
+1 940 440 8084 / +91 953 383 7156 TODAY to join our Online IT Training course & find out how Max Online Training.com can help you embark on an exciting and lucrative IT career.
TODAY to join our Online IT Training course & find out how Max Online Training.com can help you embark on an exciting and lucrative IT career.
Visit www.maxonlinetraining.com
For Complete Course Overview and to a book @https://goo.gl/QbTVal
Data Mining With Excel 2007 And SQL Server 2008Mark Tabladillo
Introduction to Excel 2007 Data Mining Plug-In using SQL Server 2008. The presentation starts with definitions and statistical theory (without equations). Then, the audience interactively participates in four demos showing the power and possibilities of the Microsoft Data Mining Algorithms.
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
What is Big Data and why it is required and needed for the organization those who really need and generating huge amount of data and when it will be use
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of informa
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
Watch full webinar here: https://bit.ly/3hgOSwm
Data Lake technologies have been in constant evolution in recent years, with each iteration primising to fix what previous ones failed to accomplish. Several data lake engines are hitting the market with better ingestion, governance, and acceleration capabilities that aim to create the ultimate data repository. But isn't that the promise of a logical architecture with data virtualization too? So, what’s the difference between the two technologies? Are they friends or foes? This session will explore the details.
The seminar is about Data warehousing, in here we are gonna discuss about what is data warehousing, comparison b/w database and data warehouse, different data warehouse models.about Data mart, and disadvantages of data warehousing.
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
What's the origin of Big Data? What are the real life usage scenarios where Hadoop has been successfully adopted? How do you get started within your organizations?
AzureDay - Introduction Big Data Analytics.Łukasz Grala
AzureDay North 2016. Conference about cloud solutions.
What is Analytics? What is Big Data? Why Big Data we have in the cloud. What offer Microsoft for Big Data Analytics. How to start with Big Data Analytics or Advanced Analytics? Session introduce fundamentals for Big Data and Advanced Analytics.
By Data Scientist as a Service
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. Prepared By
Dr. P L Pradhan, Ph D
CSE ( System Security)
Dept of
Information
Technology
TGPCET, RTM
NAGPUR University,
NAGPUR,INDIA
2. Database, BigData, Data Science
• Database, BigData, Data Science
• Database, BigData, Data Science
• Database, BigData, Data Science
• Database, BigData, Data Science
• Database, BigData, Data Science
• Database, BigData, Data Science
• Database, BigData, Data Science
• Database, BigData, Data Science
18. • What is the difference between a primary key
and a foreign key?
• In a foreign key reference, a link is
created between two tables when the column
or columns that hold the primary key value
for one table are referenced by the column or
columns in another table. This column
becomes a foreign key in the second table.
22. Information
• A set of data item satisfying to the specific
objective.
• Data about data= Meta data
23. Database
• Set of data items logically interconnected &
satisfied to the several users simultaneously over
a LAN & WAN.
Oracle
Sybase
MS-SQL
Ingress
29. Data-Items- Records-Tables
• Tuples makes Tables
• Tables makes Database
• Database makes BigData=>D. Sc
• Hadoops helps to Extract the desired data &
infomation
33. Operational data
Operational data is not permanent- Current Data
Data is volatile
Any time & All the time data can be Read, Write &
Execute ( RWX)-Insert, Delete & Update.
Modification & Updating of Data is very risk
Therefore, Operational data have no security & privacy
38. DATA WAREHOUSING
• Separate
• High Availability, Reliability & Scalability
• Integrated
• Time Stamped( RX)
• Subject Oriented
• Non volatile-Permanent
• Accessible for all the time
41. OLTP-OLAP
• Source of data
• OLTP: Operational data; OLTPs are the original source
of the data.
• OLAP: Consolidation data; OLAP data comes from the
various OLTP Databases
• Purpose of data
• OLTP: To control and run fundamental business tasks-
Raw-Current data
• OLAP: To help with planning, problem solving, and
decision support-Life data
42. OLTP-OLAP
• What the data
• OLTP: Reveals a snapshot of on going
• OLAP: Multi-dimensional views of various kinds of
• Inserts and Updates
• OLTP: Short and fast inserts and updates initiated
by end users
• OLAP: Periodic long-running batch jobs refresh
the data
43. OLTP-OLAP
• Queries
OLTP: Relatively standardized and simple queries
Returning relatively few records
OLAP: Often complex queries involving
aggregations. Association, Collaboration
Processing Speed
OLTP: Typically very fast
OLAP: Depends on the amount of data involved;
batch data refreshes and complex queries may
take many hours; query speed can be improved
by creating indexes
44. OLTP~OLAP
• Space Requirements
• OLTP: Can be relatively small if historical data is
archived
• OLAP: Larger due to the existence of aggregation
structures and history data; requires more indexes than
OLTP
• Database Design
• OLTP: Highly normalized with many tables (3-NF)
• OLAP: Typically de-normalized with fewer tables; use of
star and/or snowflake schemas
45. Backup and Recovery
• Backup and Recovery
OLTP: Backup religiously; operational data is
critical to run the business, data loss is likely
to entail significant monetary loss and legal
liability.
OLAP: Instead of regular backups, some
environments may consider simply reloading
the OLTP data as a recovery method source:
49. OLTP
• On line transaction processing, or OLTP, is a
class of information systems that facilitate and
manage transaction-oriented applications,
typically for data entry and retrieval
transaction processing.
• Temporary Data- Current Data
50. OLAP
• OLAP is an acronym for Online Analytical
Processing. OLAP performs multidimensional
analysis of business data and provides the
capability for complex calculations, trend
analysis, and sophisticated data modeling.
• Past & Present Data
57. Big Data
• Extremely large data sets that may be
analysed computationally to reveal patterns,
trends, and associations, especially relating to
human behaviour and interactions.
• HCI-Human Computer Interaction on BD
• “Much more IT investment is going towards
managing and maintaining big data"
58. Big-Data
• Challenges include analysis, capture, data
curation, search, sharing, storage, transfer,
visualization, querying,
updating and information privacy.
• The term often refers simply to the use of
predictive analytics, user behavior analytics,
or certain other advanced data analytics
methods that extract value from data, and
seldom to a particular size of data set
59. Characteristics
• Big Data represents the Information assets
characterized by such a High Volume, Velocity
and Variety to require specific Technology and
Analytical Methods for its transformation into
Value
60. Big data
• Big data is arriving from multiple sources at
an alarming velocity, volume and variety. To
extract meaningful value from big data, you
need optimal processing power, analytics
capabilities and skills. ... Insights from big
data can enable all employees to make
better decisions ...
62. BRT
• Big Data is a collection of large datasets that
cannot be processed using traditional
computing techniques. It is not a single
technique or a tool, rather it involves many
areas of Business, Resource and Technology.
65. Characteristics
• Volume: big data doesn't sample; it just observes
and tracks what happens
• Velocity: big data is often available in real-time
• Variety: big data draws from text, images, audio,
video; plus it completes missing pieces
through data fusion
• Machine Learning: big data often doesn't ask why
and simply detects patterns
• Digital footprint: big data is often a cost-free by
product of digital interaction
66. Characteristics
• Volume
• The quantity of generated and stored data. The size of the data determines the
value and potential insight- and whether it can actually be considered big data or
not.
• Variety
• The type and nature of the data. This helps people who analyze it to effectively use
the resulting insight.
• Velocity
• In this context, the speed at which the data is generated and processed to meet
the demands and challenges that lie in the path of growth and development.
• Variability
• Inconsistency of the data set can hamper processes to handle and manage it.
• Veracity
• The quality of captured data can vary greatly, affecting accurate analysis.
67. 6C
• Factory work and Cyber-physical systems may
have a 6C system:
• Connection (sensor and networks)
• Cloud (computing and data on demand)
• Cyber (model and memory)
• Content/context (meaning and correlation)
• Community (sharing and collaboration)
• Customization (personalization and value)
68. What Comes Under Big Data?
• Black Box Data : It is a component of helicopter, airplanes, and jets,
etc. It captures voices of the flight crew, recordings of microphones
and earphones, and the performance information of the aircraft.
• Social Media Data : Social media such as Facebook and Twitter hold
information and the views posted by millions of people across the
globe.
• Stock Exchange Data : The stock exchange data holds information
about the ‘buy’ and ‘sell’ decisions made on a share of different
companies made by the customers.
• Power Grid Data : The power grid data holds information
consumed by a particular node with respect to a base station.
• Transport Data : Transport data includes model, capacity, distance
and availability of a vehicle.
• Search Engine Data : Search engines retrieve lots of data from
different databases.
70. 3V
• Thus Big Data includes huge volume, high
velocity, and extensible large variety of data.
The data in it will be of three types.
• Structured data : Relational data.
• Semi Structured data : XML data.
• Unstructured data : Word, PDF, Text, Media
Logs.
73. Big Data
Challenges
The major challenges associated with big data are
as follows:
• Capturing data
• Data Curation
• Storage
• Searching
• Sharing
• Transfer
• Analysis, visuallation, association, collaboration,
communications ( OOS, OOP, UML)
• Presentation
75. DSP
• The Data Science Process
• The Data Science Process is a framework for
approaching data science tasks, and is crafted
by Joe Blitzstein and Hanspeter Pfister of
Harvard's CS 109. The goal of CS 109, as per
Blitzstein himself, is to introduce students to
the overall process of data science
investigation, a goal which should provide
some insight into the framework itself.
77. Data Science
• Data science is an interdisciplinary field about
processes and systems to
extract knowledge or insights from data in
various forms, either structured or
unstructured, which is a continuation of some
of the data analysis fields such
as statistics, data mining, and predictive
analytics, similar to Knowledge Discovery in
Databases (KDD).
78. DS
• Data science employs techniques and theories drawn from
many fields within the broad areas of mathematics,
statistics, operations research, information science, and
computer science, including signal processing, probability
models, machine learning, statistical learning, data mining,
database, data engineering, pattern recognition and
learning, visualization, predictive analytics, uncertainty
modelling, data warehousing, data compression, computer
programming, artificial intelligence, and high performance
computing. Methods that scale to big data are of particular
interest in data science, although the discipline is not
generally considered to be restricted to such big data, and
big data solutions are often focused on organizing and pre-
processing the data instead of analysis. The development of
machine learning has enhanced the growth and importance
of data science.
79. CRISP-DM
• CRISP-DM
• As a comparison to the Data Science Process put
forth by Blitzstein & Pfister, and elaborated upon
by Squire, we take a quick look at the de facto
official (yet unquestionably falling out of fashion)
data mining framework (which has been
extended to data science problems), the Cross
Industry Standard Process for Data Mining
(CRISP-DM). Though the standard is no longer
actively maintained, it remains a popular
frameworkfor navigating data science projects.
82. Knowledge Discovery in Databases
• KDD Process
• Around the same time that CRISP-DM was emerging, the KDD
Process had finished developing. The KDD (Knowledge Discovery
in Databases) Process, by Fayyad, Piatetsky-Shapiro, and Smyth, is
a framework which has, at its core, "the application of specific data-
mining methods for pattern discovery and extraction." The
framework consists of the following steps:
Selection
Preprocessing
Transformation
Data Mining
Interpretation
84. SAS-SEMMA
• Discussion
• It is important to note that these are not the only
frameworks in this space; SEMMA (for Sample, Explore,
Modify, Model and Assess), from SAS, and the agile-
oriented Guerilla Analyticsboth come to mind. There
are also numerous in-house processes that various
data science teams and individuals no doubt employ
across any number of companies and industries in
which data scientists work.
• So, is the Data Science Process a new take on CRISP-
DM, which is just a reworking of KDD, or is it a new,
independent framework in its own right?
86. Data science
Exploratory data analysis
Information design
Interactive data visualization
Descriptive statistics
Inferential statistics
Statistical graphics
Plot
Data analysis • Infographic
88. DS
• Data science affects academic and applied research in
many domains, including machine translation, speech
recognition, robotics,search engines, digital economy,
but also the biological sciences, medical
informatics, health care, social sciences and the
humanities.
• It heavily influences economics, business and finance.
From the business perspective, data science is an
integral part of competitive intelligence, a newly
emerging field that encompasses a number of
activities, such as data mining and data analysis.
89. Data scientist
• Data scientists use their data and analytical ability to
find and interpret rich data sources; manage large
amounts of data despite hardware, software, and
bandwidth constraints; merge data sources; ensure
consistency of datasets; create visualizations to aid in
understanding data; build mathematical models using
the data; and present and communicate the data
insights/findings. They are often expected to produce
answers in days rather than months, work by
exploratory analysis and rapid iteration, and to produce
and present results with dashboards (displays of
current values) rather than papers/reports, as
statisticians normally do
94. Fact Data
• Facts of a business process
• Quality of Business: sales , cost , and profit
• In data warehousing, a Fact table consists of the measurements,
metrics or facts of a business process. It is located at the center of a
star schema or a snowflake schema surrounded by
dimension tables. Where multiple fact tables are used, these are
arranged as a fact constellation schema.
• Fact tables are the large tables in our warehouse schema that store
business measurements. Fact tables typically contain facts and
foreign keys to the dimension tables. Fact tables represent data,
usually numeric and additive, that can be analyzed and
examined. Examples include sales , cost , and profit .