This document provides an overview of big data adoption and analytics technologies. It discusses prerequisites for organizations adopting big data such as data governance frameworks and skillsets. It also outlines the typical big data analytics lifecycle including stages like data identification, analysis, and utilization of results. Finally, it describes various enterprise technologies that support big data analytics like extract-transform-load (ETL) processes, data warehouses, online transaction processing (OLTP), and online analytical processing (OLAP).
Tableau’s predictive modeling feature allows users to leverage powerful statistical models to build and update predictive models efficiently while giving them the flexibility to select their predictors, collaborate on the model results within other table calculations, and comprehend and examine a large volume of data. Go through this presentation to discover how Tableau’s predictive modeling feature allows users to leverage powerful statistical models to build and update predictive models efficiently.
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
Joe Caserta went over the details inside the big data ecosystem and the Caserta Concepts Data Pyramid, which includes Data Ingestion, Data Lake/Data Science Workbench and the Big Data Warehouse. He then dove into the foundation of dimensional data modeling, which is as important as ever in the top tier of the Data Pyramid. Topics covered:
- The 3 grains of Fact Tables
- Modeling the different types of Slowly Changing Dimensions
- Advanced Modeling techniques like Ragged Hierarchies, Bridge Tables, etc.
- ETL Architecture.
He also talked about ModelStorming, a technique used to quickly convert business requirements into an Event Matrix and Dimensional Data Model.
This was a jam-packed abbreviated version of 4 days of rigorous training of these techniques being taught in September by Joe Caserta (Co-Author, with Ralph Kimball, The Data Warehouse ETL Toolkit) and Lawrence Corr (Author, Agile Data Warehouse Design).
For more information, visit http://casertaconcepts.com/.
Data mining and data warehousing, database management system, Data mining and data warehousing, complete presentation of Data mining and data warehousing,
Reconciling your Enterprise Data Warehouse to Source SystemsMethod360
Implementing and an enterprise BI system is a significant organization investment. Too many times the expected benefit of that investment isn’t realized due to inconsistent data between the organization’s operational and BI systems.
This webcast will explain several options to enable your organization to leverage its investment by providing options to reconcile the data from source operational systems to BI.
Tableau’s predictive modeling feature allows users to leverage powerful statistical models to build and update predictive models efficiently while giving them the flexibility to select their predictors, collaborate on the model results within other table calculations, and comprehend and examine a large volume of data. Go through this presentation to discover how Tableau’s predictive modeling feature allows users to leverage powerful statistical models to build and update predictive models efficiently.
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
Joe Caserta went over the details inside the big data ecosystem and the Caserta Concepts Data Pyramid, which includes Data Ingestion, Data Lake/Data Science Workbench and the Big Data Warehouse. He then dove into the foundation of dimensional data modeling, which is as important as ever in the top tier of the Data Pyramid. Topics covered:
- The 3 grains of Fact Tables
- Modeling the different types of Slowly Changing Dimensions
- Advanced Modeling techniques like Ragged Hierarchies, Bridge Tables, etc.
- ETL Architecture.
He also talked about ModelStorming, a technique used to quickly convert business requirements into an Event Matrix and Dimensional Data Model.
This was a jam-packed abbreviated version of 4 days of rigorous training of these techniques being taught in September by Joe Caserta (Co-Author, with Ralph Kimball, The Data Warehouse ETL Toolkit) and Lawrence Corr (Author, Agile Data Warehouse Design).
For more information, visit http://casertaconcepts.com/.
Data mining and data warehousing, database management system, Data mining and data warehousing, complete presentation of Data mining and data warehousing,
Reconciling your Enterprise Data Warehouse to Source SystemsMethod360
Implementing and an enterprise BI system is a significant organization investment. Too many times the expected benefit of that investment isn’t realized due to inconsistent data between the organization’s operational and BI systems.
This webcast will explain several options to enable your organization to leverage its investment by providing options to reconcile the data from source operational systems to BI.
Difference between data warehouse and data miningmaxonlinetr
What exactly is a Data Warehouse?
Termed as a special type of database, a Data Warehouse is used for storing large amounts of data, such as analytics, historical, or customer data, which can be leveraged to build large reports and also ensure data mining against it.@ http://maxonlinetraining.com/why-is-data-warehousing-online-training-important/
What is Data mining?
The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions’
Call us at For any queries, please contact:
+1 940 440 8084 / +91 953 383 7156 TODAY to join our Online IT Training course & find out how Max Online Training.com can help you embark on an exciting and lucrative IT career.
TODAY to join our Online IT Training course & find out how Max Online Training.com can help you embark on an exciting and lucrative IT career.
Visit www.maxonlinetraining.com
For Complete Course Overview and to a book @https://goo.gl/QbTVal
this is the ppt this contains definition of data ware house , data , ware house, data modeling , data warehouse architecture and its type , data warehouse types, single tire, two tire, three tire .
Data Warehousing is a data architecture that separates reporting and analytics needs from operational transaction systems. This presentation is an introduction into traditional data warehousing architectures and how to determine if your environment requires a data warehouse.
In this Strata+Hadoop World 2015 presentation, Ron Bodkin, President of Think Big, a Teradata company, explains changes for data modeling on big data systems and five important new analytic patterns becoming more commonplace as companies grow their data driven capabilities.
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for knowledge workers throughout the enterprise.
Difference between data warehouse and data miningmaxonlinetr
What exactly is a Data Warehouse?
Termed as a special type of database, a Data Warehouse is used for storing large amounts of data, such as analytics, historical, or customer data, which can be leveraged to build large reports and also ensure data mining against it.@ http://maxonlinetraining.com/why-is-data-warehousing-online-training-important/
What is Data mining?
The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions’
Call us at For any queries, please contact:
+1 940 440 8084 / +91 953 383 7156 TODAY to join our Online IT Training course & find out how Max Online Training.com can help you embark on an exciting and lucrative IT career.
TODAY to join our Online IT Training course & find out how Max Online Training.com can help you embark on an exciting and lucrative IT career.
Visit www.maxonlinetraining.com
For Complete Course Overview and to a book @https://goo.gl/QbTVal
this is the ppt this contains definition of data ware house , data , ware house, data modeling , data warehouse architecture and its type , data warehouse types, single tire, two tire, three tire .
Data Warehousing is a data architecture that separates reporting and analytics needs from operational transaction systems. This presentation is an introduction into traditional data warehousing architectures and how to determine if your environment requires a data warehouse.
In this Strata+Hadoop World 2015 presentation, Ron Bodkin, President of Think Big, a Teradata company, explains changes for data modeling on big data systems and five important new analytic patterns becoming more commonplace as companies grow their data driven capabilities.
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for knowledge workers throughout the enterprise.
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxrandyburney60861
DATA SCIENCE AND BIG DATA
ANALYTICS
CHAPTER 2:
DATA ANALYTICS LIFECYCLE
DATA ANALYTICS LIFECYCLE
• Data science projects differ from BI projects
• More exploratory in nature
• Critical to have a project process
• Participants should be thorough and rigorous
• Break large projects into smaller pieces
• Spend time to plan and scope the work
• Documenting adds rigor and credibility
DATA ANALYTICS LIFECYCLE
• Data Analytics Lifecycle Overview
• Phase 1: Discovery
• Phase 2: Data Preparation
• Phase 3: Model Planning
• Phase 4: Model Building
• Phase 5: Communicate Results
• Phase 6: Operationalize
• Case Study: GINA
2.1 DATA ANALYTICS
LIFECYCLE OVERVIEW
• The data analytic lifecycle is designed for Big Data problems and
data science projects
• With six phases the project work can occur in several phases
simultaneously
• The cycle is iterative to portray a real project
• Work can return to earlier phases as new information is uncovered
2.1.1 KEY ROLES FOR A
SUCCESSFUL ANALYTICS
PROJECT
KEY ROLES FOR A
SUCCESSFUL ANALYTICS
PROJECT
• Business User – understands the domain area
• Project Sponsor – provides requirements
• Project Manager – ensures meeting objectives
• Business Intelligence Analyst – provides business domain
expertise based on deep understanding of the data
• Database Administrator (DBA) – creates DB environment
• Data Engineer – provides technical skills, assists data
management and extraction, supports analytic sandbox
• Data Scientist – provides analytic techniques and modeling
2.1.2 BACKGROUND AND OVERVIEW
OF DATA ANALYTICS LIFECYCLE
• Data Analytics Lifecycle defines the analytics process and
best practices from discovery to project completion
• The Lifecycle employs aspects of
• Scientific method
• Cross Industry Standard Process for Data Mining (CRISP-DM)
• Process model for data mining
• Davenport’s DELTA framework
• Hubbard’s Applied Information Economics (AIE) approach
• MAD Skills: New Analysis Practices for Big Data by Cohen et al.
https://en.wikipedia.org/wiki/Scientific_method
https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
http://www.informationweek.com/software/information-management/analytics-at-work-qanda-with-tom-davenport/d/d-id/1085869?
https://en.wikipedia.org/wiki/Applied_information_economics
https://pafnuty.wordpress.com/2013/03/15/reading-log-mad-skills-new-analysis-practices-for-big-data-cohen/
OVERVIEW OF
DATA ANALYTICS LIFECYCLE
2.2 PHASE 1: DISCOVERY
2.2 PHASE 1: DISCOVERY
1. Learning the Business Domain
2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources
2.3 PHASE 2: DATA PREPARATION
2.3 PHASE 2: DATA
PREPARATION
• Includes steps to explore, preprocess, and condition
data
• Create robust environment – analytics sandbox
• Data preparation tends to be t.
Drive Smarter Decisions with Big Data Using Complex Event ProcessingPerficient, Inc.
This webinar described what CEP is and how it has been deployed in several client organizations to provide more agile, cost-effective and real-time integration across multiple data stores including:
Analysis of large amounts of complex, unstructured and semi-structured data
Harnessing the power big data, social/mobile data stores and BI projects for real-time decision making
Predicting events before they happen based on patterns and rules
The Shifting Landscape of Data IntegrationDATAVERSITY
Enterprises and organizations from every industry and scale are working to leverage data to achieve their strategic objectives — whether they are to be more profitable, effective, risk-tolerant, prepared, sustainable, and/or adaptable in an ever-changing world. Data has exploded in volume during the last decade as humans and machines alike produce data at an exponential pace. Also, exciting technologies have emerged around that data to improve our abilities and capabilities around what we can do with data.
Behind this data revolution, there are forces at work, causing enterprises to shift the way they leverage data and accelerate the demand for leverageable data. Organizations (and the climates in which they operate) are becoming more and more complex. They are also becoming increasingly digital and, thus, dependent on how data informs, transforms, and automates their operations and decisions. With increased digitization comes an increased need for both scale and agility at scale.
In this session, we have undertaken an ambitious goal of evaluating the current vendor landscape and assessing which platforms have made, or are in the process of making, the leap to this new generation of Data Management and integration capabilities.
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
Watch full webinar here: https://bit.ly/35FUn32
Presented at CDAO New Zealand
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python, and Scala put advanced techniques at the fingertips of the data scientists.
However, most architecture laid out to enable data scientists miss two key challenges:
- Data scientists spend most of their time looking for the right data and massaging it into a usable format
- Results and algorithms created by data scientists often stay out of the reach of regular data analysts and business users
Watch this session on-demand to understand how data virtualization offers an alternative to address these issues and can accelerate data acquisition and massaging. And a customer story on the use of Machine Learning with data virtualization.
This is a slide deck that was assembled as a result of months of Project work at a Global Multinational. Collaboration with some incredibly smart people resulted in content that I wish I had come across prior to having to have assembled this.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
4. Organization Prerequisites
• Data management and Big Data
governance frameworks.
• Sound processes and sufficient skillsets
personnel.
• Data quality assessment.
• Longevity of the Big Data environment
• Environment expansion roadmap
4
5. Data Procurement
• The nature of the business may make
external data very valuable.
• External data sources may include
government data sources and commercial
data markets.
• The greater the volume and variety of data
that can be supplied, the higher the
chances are of finding hidden insights
from patterns. 5
6. Data Costs
• Government-provided data, such as geo-
spatial data, may be free.
• Most commercially relevant data will need
to be purchased or subscribed.
• A substantial budget may still be required
to obtain external data.
6
7. Privacy
• Performing analytics on datasets jointly can
reveal confidential information about
organizations or individuals.
• Requires an understanding of the nature of data
• data privacy regulations
• special techniques for data tagging and
anonymization.
• For example, car’s GPS log or smart meter
data readings, collected over an extended
period of time can reveal an individual’s
location and behavior. 7
9. Security
• Securing Big Data involves ensuring that
the data networks and repositories are
sufficiently secured
• Need to employ proper mechanisms:
• Authentication
• Authorization
• Access Levels
9
10. Provenance
• Refers to information about the source of
the data and how it has been processed.
• Determine the authenticity and quality of
data
• Can be used for auditing purposes.
10
11. Data States
• At different stages in the analytics lifecycle, data
will be in different states.
• These states are:
• data-in-motion
• data-in-use
• data-at-rest.
• Whenever Big Data changes state, it should trigger
the capture of provenance information that is
recorded as metadata.
• At the beginning, data’s provenance record can be
initialized with the recording of information that
captures its pedigree. 11
12. Trusted Result
• Provenance information is essential to
being able to realize the value of the
analytic result.
• What is the origin of the data and what
steps or algorithms were used to process
the data that led to the result.
12
14. Clouds
• Provide remote environments that can host IT
infrastructure.
• Big Data environment may necessitate that
some or all of that environment be hosted
within a cloud.
• Example
• An enterprise that runs its CRM system in a
cloud decides to add a Big Data solution in
order to run analytics on its CRM data.
• This data can then be shared with its primary
Big Data environment that resides within the
enterprise boundaries.
14
15. Cloud Justifications
• Inadequate in-house hardware resources
• Upfront capital investment for system
procurement is not available
• Need of sandbox environment so that existing
business processes are not impacted
• The Big Data initiative is a proof of concept
• Datasets that need to be processed are
already cloud resident
• The limits of available computing and storage
resources.
15
18. Business Case Evaluation
• Presents a clear understanding of the
justification, motivation and goals of
carrying out the analysis.
• Helps decision-makers understand the
business resources that will need to be
utilized and which business challenges the
analysis will tackle.
18
19. Data Identification
• Identifying the datasets required for the
analysis project and their sources.
• A wider variety of data sources may
increase the probability of finding hidden
patterns and correlations.
• Data sources can be internal and/or
external to the enterprise, depend on the
business scope of the analysis project and
nature of the business problems being
addressed. 19
20. Data Acquisition and Filtering
• Both internal and external data
needs to be persisted once it
gets generated or enters the
enterprise boundary.
• Metadata can be added via
automation to data from both
internal and external data
sources to improve the
classification and querying.
20
21. Data Extraction
• To extracting disparate data and
transforming it into a format that the
underlying Big Data solution.
Comments and user IDs are extracted from an XML
document.
The user ID and coordinates of a user are extracted from a single
JSON field.
21
22. Data Validation and Cleansing
• To establishing often complex validation rules and
removing any known invalid data.
• Can be used to examine interconnected datasets in
order to fill in missing valid data.
Data validation can
be used to examine
interconnected
datasets in order to
fill in missing valid
data.
The presence of invalid data is resulting in spikes.
Although the data appears abnormal, it may be indicative
of a new pattern.
22
23. Data Aggregation and
Representation
• To integrating multiple datasets together to arrive
at a unified view.
• this stage can become complicated because of
differences in Data Structure and Semantics.
• it is important to understand that the same data
can be stored in many different forms.
two datasets are
aggregated
together using
the Id field.
23
24. Data Analysis
• Usually iterative in nature until the
appropriate pattern or correlation is
uncovered.
• Can be as challenging as combining data
mining and complex statistical analysis
techniques to discover patterns and
anomalies.
• Use to generate a statistical or
mathematical model to depict
relationships between variables. 24
26. Confirmatory Data Analysis
• A deductive approach where the cause of the
phenomenon being investigated is proposed
beforehand called a hypothesis.
• The data is then analyzed to prove or
disprove the hypothesis and provide
definitive answers to specific questions.
• Data sampling techniques are typically used.
• Unexpected findings or anomalies are usually
ignored since a predetermined cause was
assumed. 26
27. Exploratory data analysis
• An inductive approach that is closely
associated with data mining.
• No hypothesis or predetermined assumptions
are generated.
• The data is explored through analysis to
develop an understanding of the cause of the
phenomenon.
• May not provide definitive answers
• Use to find a general direction that can
facilitate the discovery of patterns or
anomalies. 27
28. Data Visualization
• Data visualization techniques and tools are used
to graphically communicate the analysis results
for effective interpretation by business users.
28
29. Data Visualization
• Provide users with the ability to perform visual
analysis
• Allow the discovery of answers to questions that
users have not yet even formulated.
29
30. Utilization of Analysis Results
• To determining how and where processed
analysis data can be further leveraged.
• Common areas that are explored during
this stage include the following:
1. Input for Enterprise Systems
2. Business Process Optimization
3. Alerts
30
31. Input for Enterprise Systems
• The results may be fed directly into
enterprise systems to enhance and
optimize their behaviors and performance.
• New models may be used to improve the
programming logic within existing
enterprise systems or may form the basis
of new systems.
31
32. Business Process Optimization
• The identified patterns, correlations and
anomalies discovered during the data
analysis are used to refine business
processes.
• Models may also lead to opportunities to
improve business process logic.
32
33. Alerts
• Data analysis results can be used as input
for existing alerts or may form the basis of
new alerts.
33
35. Overview
• An enterprise executed as a layered
system
• The strategic layer constrains the tactical
layer, which directs the operational layer.
• The alignment of layers is captured
through metrics and performance
indicators
• Allows the operational layer to gain insight
into how its processes are executing.
35
36. Transformation of Data
• Process measurements are aggregated and
enhanced with additional meaning to
become key performance Indices (KPIs)
• Used by tactical layer to assess corporate
performance, or business execution.
• KPIs are used in conjunction with other
measurements and understandings to assess
critical success factors (CSF).
• Forms series of data enrichment regarding
transformation of data into information,
information into knowledge and knowledge
into wisdom.
36
37. Online Transaction Processing
(OLTP)
• A software system that processes
transaction-oriented data.
• Completes of an activity in real-time and is
not batch-processed.
• Stores operational data that is normalized.
37
38. OLTP Data
• A common source of structured data and
serves as input to many analytic processes.
• Big Data analysis results can be used to
augment OLTP data stored in the
underlying relational databases.
• The queries supported by OLTP systems
are comprised of simple insert, delete and
update operations with sub-second
response times. 38
39. Online Analytical Processing
(OLAP)
• A system that is used for processing data
analysis queries.
• Forms an integral part of business
intelligence, data mining and machine
learning processes.
• Can serve as both a data source as well as
a data sink that is capable of receiving
data.
• They are used in diagnostic, predictive and
prescriptive analytics.
39
40. OLAP
• Performs long-running, complex queries
against a multidimensional database
whose structure is optimized for
performing advanced analytics.
• Stores historical data that is aggregated
and denormalized to support fast
reporting capability.
• Use databases that store historical data in
multidimensional structures 40
41. OLAP
• Provides answers to complex queries
based on the relationships between
multiple aspects of the data.
41
42. Extract Transform Load (ETL)
• A process of loading data from a source
system into a target system.
• The source system can be a database, a
flat file, or an application.
• Similarly, the target system can be a
database or some other storage system.
• Represents the main operation through
which data warehouses are fed data. 42
43. ETL Steps
1. The required data is obtained or extracted from
the sources.
2. The extracts are modified or transformed by the
application of rules.
3. The data is inserted or loaded into the target
system.
43
44. Data Warehouses
• Central and enterprise-wide repository
consisting of historical and current data.
44
45. Data Warehouses
• Store data pertaining to multiple business
entities from different operational systems
• Heavily used by BI to run various analytical
queries.
• Usually interface with an OLAP system to
support multi-dimensional analytical
queries.
45
46. Data Warehouses
• Data is periodically extracted, validated,
transformed and consolidated into a single
denormalized database.
• Over time, the amount of data contained
will continue to increase that leads to
slower query response times for data
analysis tasks.
• Usually resolve by containing optimized
databases, called analytical databases, to
handle reporting and data analysis tasks. 46
47. Data Mart
47
• A subset of the data stored in a data
warehouse that typically belongs to a
department, division, or specific line of
business.
• Data warehouses can have multiple data
marts.
48. Data Mart
• Enterprise-wide data is collected and
business entities are then extracted.
• Domain-specific entities are persisted into
the data warehouse via an ETL process.
48
49. Traditional BI
49
• Primarily utilizes descriptive and
diagnostic analytics
• Provides information on historical and
current events.
• Only provides answers to correctly
formulated questions.
50. Traditional BI
• Correctly formulating questions requires
an understanding of business problems
and issues and of the data itself.
50
51. BI Report Types
BI reports can be categorized base on
different KPIs:
• ad-hoc reports
• dashboards
51
52. Ad-hoc Reports
• A process that involves manually processing
data to produce custom-made reports.
• Usually focuses on a specific area of the
business, such as its marketing or supply chain
management.
• The generated custom reports are detailed and
often tabular in nature.
52
53. Dashboards
53
• A holistic view of key business areas.
• The information displayed is generated at
periodic intervals in real-time or near-real-time.
• The presentation of data is graphical in nature,
using bar charts, pie charts and gauges.
54. Big Data BI
54
• Builds upon traditional BI by combining data
stored in warehouse with semi-structured
and unstructured data sources.
• Comprises both predictive and prescriptive
analytics
• Facilitate the development of an enterprise-
wide understanding of business performance.
55. Big Data BI vs Traditional BI
• Traditional BI analyses generally focus on
individual business processes
• Big Data BI analyses focus on multiple
business processes simultaneously.
• Helps reveal patterns and anomalies across a
broader scope within the enterprise.
• Leads to data discovery by identifying insights
and information that may have been
previously absent or unknown. 55
56. Hybrid Data Warehouse
• A uniform and central repository of
structured, semi-structured and unstructured
data
• Provide essential Big Data BI tools with all of
the required data.
• Eliminates the need to connect to multiple
data sources to retrieve or access data.
• Establishes a standardized data access layer
across a range of data sources. 56
Editor's Notes
How to learn
Emphasize on programming with Java
Apology for document
Document is not quite complete
Some parts are irrelevant
Some just get added because of its interesting nature
Some are missing
Some are not part of this documentß
Student must lecture on undocumented details
Most commercially relevant data will need to be purchased and may involve the continuation of subscription costs to ensure the delivery of updates to procured datasets.
Even analyzing separate datasets that contain seemingly benign data can reveal private information when the datasets are analyzed jointly.
This can lead to intentional or inadvertent breaches of privacy.
Addressing these privacy concerns requires an understanding of the nature of data being accumulated and relevant data privacy regulations, as well as special techniques for data tagging and anonymization.
For example, telemetry data, such as a car’s GPS log or smart meter data readings, collected over an extended period of time can reveal an individual’s location and behavior.
Some of the components of Big Data solutions lack the robustness of traditional enterprise solution environments when it comes to access control and data security.
Big Data security further involves establishing data access levels for different categories of users.
Helps determine the authenticity and quality of data, and it can be used for auditing purposes.
Maintaining provenance as large volumes of data are acquired, combined and
put through multiple processing stages can be a complex task.
At different stages in the analytics lifecycle, data will be in different states due to
the fact it may be being transmitted, processed or in storage.
As data enters the analytic environment, its provenance record can be
Initialized with the recording of information that captures the pedigree of the data.
Goal of capturing provenance is to be able to reason over the generated analytic results
When provenance information is captured on the way to generating analytic results as in Figure 3.3, the results can be more easily trusted and thereby used with confidence.
Provide remote environments that can host IT infrastructure for large-scale storage and processing.
The adoption of a Big Data environment may necessitate that some or all of that environment be hosted within a cloud.
the limits of available computing and storage resources used by an in-house Big Data solution are being reached
Data sources can be internal and/or external to the enterprise,
depend on the business scope of the analysis project
and nature of the business problems being addressed.
Performing this stage can become complicated because of differences in:
• Data Structure – Although the data format may be the same, the data model may be
different.
• Semantics – A value that is labeled differently in two different datasets may mean
the same thing, for example “surname” and “last name.
An inductive approach that is closely associated with data mining.
The data is explored through analysis to develop an understanding of the cause of the phenomenon.
For example, an online store can be fed processed customer-related analysis results that may impact how it generates product recommendations.
An example is consolidating transportation routes as part of a supply chain process.
For example, alerts may be created to inform users via email or SMS text about an event that requires them to take corrective action.
>>> Strategic layer constrains tactical
>>> Tactical directs operational
>>> alignment captured through metric and PI
>>>> gain insight
Ultimately, this series of enrichment corresponds with the transformation of data into information, information into knowledge and knowledge into wisdom.
E.g.: a point of sale system, execute business processes in support of corporate operations.
Examples include ticket reservation systems, banking and point of sale systems.
They are relevant to Big Data in that they can serve as both a data source as well as a data sink that is capable of receiving data.
>>>> OLAP is an integral part of BI, data mining & ML.
>>>> can be both data sources and sink
>>>> used in diagnostic, predictive , and prescriptive analytics
>>>> complex queries of multi-dimensional data
>>>> historical data is aggregated and denormalized to be reported
>>>> use multi-dimensional historical data
A Big Data solution encompasses the ETL feature-set for converting data of different types.
Over time this leads to slower query response times for data analysis tasks.
To resolve this shortcoming, data warehouses usually contain optimized databases, called analytical databases, to handle reporting and data analysis tasks.
An analytical database can exist as a separate DBMS, as in the case of an OLAP database.
As previously explained, data warehouses and data marts contain consolidated and validated information about enterprise-wide business entities.
Traditional BI cannot function effectively without data marts because they contain the optimized and segregated data that BI requires for reporting purposes.
Without data marts, data needs to be extracted from the data warehouse via an ETL process on an ad-hoc basis whenever a query needs to be run.
This increases the time and effort to execute queries and generate reports.
Traditional BI uses data warehouses and data marts for reporting and data analysis because they allow complex data analysis queries with multiple joins and aggregations to be issued.
Big Data BI requires the analysis of unstructured, semi-structured and structured data residing in the enterprise data warehouse.
This requires a “next-generation” data warehouse that uses new features and technologies to store cleansed data originating from a variety of sources in a single uniform data format.
>>> Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms.
>>>>> A data lake is a vast pool of raw data, the purpose for which is not yet defined.
>>>>> A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.