Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Data warehousing and business intelligence project reportsonalighai
Developed Data warehouse project with a structured, semi-structured and unstructured sources of data
and generated Business Intelligence reports. Topic for the project was Tobacco products consumption in
America. Studied on which products are more famous among people across and also got to know that
middle school students are the soft targets for the tobacco companies as maximum people start taking
tobacco products at this age.
Tools used: SSMS, SSIS, SSAS, SSRS, R-Studio, Power BI, Excel
Project report on the design and build of a data warehouse from unstructured and structured data sources (Quandl, yelp and UK Office for National Statistics) using SQL Server 2016, MongoDB and IBM Watson. Design and implementation of business intelligence visualisations using Tableau to answer cross domain business questions
DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.
After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!
You can find more on-line training at: http://LearnDataVault.com/training
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Data warehousing and business intelligence project reportsonalighai
Developed Data warehouse project with a structured, semi-structured and unstructured sources of data
and generated Business Intelligence reports. Topic for the project was Tobacco products consumption in
America. Studied on which products are more famous among people across and also got to know that
middle school students are the soft targets for the tobacco companies as maximum people start taking
tobacco products at this age.
Tools used: SSMS, SSIS, SSAS, SSRS, R-Studio, Power BI, Excel
Project report on the design and build of a data warehouse from unstructured and structured data sources (Quandl, yelp and UK Office for National Statistics) using SQL Server 2016, MongoDB and IBM Watson. Design and implementation of business intelligence visualisations using Tableau to answer cross domain business questions
DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.
After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!
You can find more on-line training at: http://LearnDataVault.com/training
Data visualization is a complex set of processes which is like an umbrella that covers both information and scientific visualization simultaneously. We can’t ignore the benefits of data visualization for its accurate quantities, as it is easily comparable. It also lends valuable suggestion pertaining to the usage of its technique and tools. Scientifically its effectiveness lies in our brain's ability to maintain a proper balance between perception and cognition through visualization.
Data Virtualization: Introduction and Business Value (UK)Denodo
Watch full webinar here: https://bit.ly/30mHuYH
What started to evolve as the most agile and real-time enterprise data fabric, data virtualization is proving to go beyond its initial promise and is becoming one of the most important enterprise big data fabrics. Denodo’s vision is to provide a unified data delivery layer as a logical data fabric, to bridge the gap between the IT and the business, hiding the underlying complexity and creating a semantic layer to expose data in a business friendly manner.
Attend this webinar to learn:
- What data virtualization really is
- How it differs from other enterprise data integration technologies
- Why data virtualization is finding enterprise-wide deployment inside some of the largest organizations
- Business Value of data virtualization and customer use cases
- Highlights of the newly launched Denodo Platform 8.0
Wallchart - Data Warehouse Documentation RoadmapDavid Walker
All projects need documentation and many companies provide templates as part of a methodology. This document describes the templates, tools and source documents used by Data Management & Warehousing. It serves two purposes:
• For projects using other methodologies or creating their own set of documents to use as a checklist. This allows the project to ensure that the documentation covers the essential areas for describing the data warehouse.
• To demonstrate our approach to our clients by describing the templates and deliverables that are produced.
Documentation, methodologies and templates are inherently both incomplete and flexible. Projects may wish to add, change, remove or ignore any part of any document. Some may also believe that aspects of one document would sit better in another. If this is the case then users of this document and these templates are encouraged to change them to fit their needs.
Data Management & Warehousing believes that the approach or methodology for building a data warehouse should be to use a series of guides and checklists. This ensures that small teams of relatively skilled resources developing the system can cover all aspects of the project whilst being free to deal with the specific issues of their environment to deliver exceptional solutions, rather than a rigid methodology that ensures that large teams of relatively unskilled staff can meet a minimum standard.
The seminar is about Data warehousing, in here we are gonna discuss about what is data warehousing, comparison b/w database and data warehouse, different data warehouse models.about Data mart, and disadvantages of data warehousing.
Big Data Analytics and its Application in E-CommerceUyoyo Edosio
Abstract-This era unlike any, is faced with explosive
growth in the size of data generated/captured. Data
growth has undergone a renaissance, influenced
primarily by ever cheaper computing power and
the ubiquity of the internet. This has led to a
paradigm shift in the E-commerce sector; as data is
no longer seen as the byproduct of their business
activities, but as their biggest asset providing: key
insights to the needs of their customers, predicting
trends in customer’s behavior, democratizing of
advertisement to suits consumers varied taste, as
well as providing a performance metric to assess the
effectiveness in meeting customers’ needs.
This paper presents an overview of the unique
features that differentiate big data from traditional
datasets. In addition, the application of big data
analytics in the E-commerce and the various
technologies that make analytics of consumer data
possible is discussed.
Further this paper will present some case studies of
how leading Ecommerce vendors like Amazon.com,
Walmart Inc, and Adidas apply Big Data analytics in
their business strategies/activities to improve their
competitive advantage. Lastly we identify some
challenges these E-commerce vendors face while
implementing big data analytic
Data visualization is a complex set of processes which is like an umbrella that covers both information and scientific visualization simultaneously. We can’t ignore the benefits of data visualization for its accurate quantities, as it is easily comparable. It also lends valuable suggestion pertaining to the usage of its technique and tools. Scientifically its effectiveness lies in our brain's ability to maintain a proper balance between perception and cognition through visualization.
Data Virtualization: Introduction and Business Value (UK)Denodo
Watch full webinar here: https://bit.ly/30mHuYH
What started to evolve as the most agile and real-time enterprise data fabric, data virtualization is proving to go beyond its initial promise and is becoming one of the most important enterprise big data fabrics. Denodo’s vision is to provide a unified data delivery layer as a logical data fabric, to bridge the gap between the IT and the business, hiding the underlying complexity and creating a semantic layer to expose data in a business friendly manner.
Attend this webinar to learn:
- What data virtualization really is
- How it differs from other enterprise data integration technologies
- Why data virtualization is finding enterprise-wide deployment inside some of the largest organizations
- Business Value of data virtualization and customer use cases
- Highlights of the newly launched Denodo Platform 8.0
Wallchart - Data Warehouse Documentation RoadmapDavid Walker
All projects need documentation and many companies provide templates as part of a methodology. This document describes the templates, tools and source documents used by Data Management & Warehousing. It serves two purposes:
• For projects using other methodologies or creating their own set of documents to use as a checklist. This allows the project to ensure that the documentation covers the essential areas for describing the data warehouse.
• To demonstrate our approach to our clients by describing the templates and deliverables that are produced.
Documentation, methodologies and templates are inherently both incomplete and flexible. Projects may wish to add, change, remove or ignore any part of any document. Some may also believe that aspects of one document would sit better in another. If this is the case then users of this document and these templates are encouraged to change them to fit their needs.
Data Management & Warehousing believes that the approach or methodology for building a data warehouse should be to use a series of guides and checklists. This ensures that small teams of relatively skilled resources developing the system can cover all aspects of the project whilst being free to deal with the specific issues of their environment to deliver exceptional solutions, rather than a rigid methodology that ensures that large teams of relatively unskilled staff can meet a minimum standard.
The seminar is about Data warehousing, in here we are gonna discuss about what is data warehousing, comparison b/w database and data warehouse, different data warehouse models.about Data mart, and disadvantages of data warehousing.
Big Data Analytics and its Application in E-CommerceUyoyo Edosio
Abstract-This era unlike any, is faced with explosive
growth in the size of data generated/captured. Data
growth has undergone a renaissance, influenced
primarily by ever cheaper computing power and
the ubiquity of the internet. This has led to a
paradigm shift in the E-commerce sector; as data is
no longer seen as the byproduct of their business
activities, but as their biggest asset providing: key
insights to the needs of their customers, predicting
trends in customer’s behavior, democratizing of
advertisement to suits consumers varied taste, as
well as providing a performance metric to assess the
effectiveness in meeting customers’ needs.
This paper presents an overview of the unique
features that differentiate big data from traditional
datasets. In addition, the application of big data
analytics in the E-commerce and the various
technologies that make analytics of consumer data
possible is discussed.
Further this paper will present some case studies of
how leading Ecommerce vendors like Amazon.com,
Walmart Inc, and Adidas apply Big Data analytics in
their business strategies/activities to improve their
competitive advantage. Lastly we identify some
challenges these E-commerce vendors face while
implementing big data analytic
We offer online IT training with placements, project assistance in different platforms with real time industry consultants to provide quality training for all it professionals, corporate clients and students etc. Special features by InformaticaTrainingClasses are Extensive Training will be in both Informatica Online Training and Placement. We help you in resume preparation and conducting Mock Interviews.
Emphasis is given on important topics which are essential and mostly used in real time projects. Informatica training Classes is an Online Training Leader when it comes to high-end effective and efficient I.T Training. We have always been and still are focusing on the key aspects which are providing utmost effective and competent training to both students and professionals who are eager to enrich their technical skills.
Training Features at Informatica training classes:
We believe that online training has to be measured by three major aspects viz., Quality, Content and Relationship with the Trainer and Student. Not only our online training classes are important but apart from that the material which we provide are in tune with the latest IT training standards, so a student has not to worry at all whether the training imparted is outdated or latest.
Course content:
• Basics of data warehousing concepts
• Power center components
• Informatica concepts and overview
• Sources
• Targets
• Transformations
• Advanced Informatica concepts
Please Visit us for the Demo Classes, we have regular batches and weekend batches.
Informatica online training classes
Phone: (404)-900-9988
Email: info@informaticatrainingclasses.com
Web: http://www.informaticatrainingclasses.com
INFORMATICA ONLINE TRAINING BY QUONTRA SOLUTIONS WITH PLACEMENT ASSISTANCE
We offer online IT training with placements, project assistance in different platforms with real time industry consultants to provide quality training for all it professionals, corporate clients and students etc. Special features by Quontra Solutions are Extensive Training will be in both Informatica Online Training and Placement. We help you in resume preparation and conducting Mock Interviews.
Emphasis is given on important topics which are essential and mostly used in real time projects. Quontra Solutions is an Online Training Leader when it comes to high-end effective and efficient I.T Training. We have always been and still are focusing on the key aspects which are providing utmost effective and competent training to both students and professionals who are eager to enrich their technical skills.
Training Features at Quontra Solutions:
We believe that online training has to be measured by three major aspects viz., Quality, Content and Relationship with the Trainer and Student. Not only our online training classes are important but apart from that the material which we provide are in tune with the latest IT training standards, so a student has not to worry at all whether the training imparted is outdated or latest.
Course content:
• Basics of data warehousing concepts
• Power center components
• Informatica concepts and overview
• Sources
• Targets
• Transformations
• Advanced Informatica concepts
Please Visit us for the Demo Classes, we have regular batches and weekend batches.
QUONTRASOLUTIONS
204-226 Imperial Drive,Rayners Lane, Harrow-HA2 7HH
Phone : +44 (0)20 3734 1498 / 99
Email: info@quontrasolutions.co.uk
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
Chapter 4. Data Warehousing and On-Line Analytical Processing.pptSubrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Data Warehousing, Data Mining, Data Marts, Data Cube, OLAP Operations, Introduction to Common Messaging System, Web Tier Deployment, Application Servers & Clustered Deployment, IBM Notes and IBM Domino
Data Warehousing is a topic on Management of Information Technology that would help students on their subject matter and as reference for their assigned report.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
2. Knowledge Management & Business
Intelligence
• Transform data to usable Information that makes sense
• Support Business decisions
• Gain competitive advantage
• Identify and analyze market and user trends
• Identify popular and profitable services and products
• Increase effectiveness of marketing
• Have the ability and flexibility to create various reports for
management and Clients quickly
• Have the ability to priorities product enhancements based on
studying customer’s behavioural trends
• Detecting anomalies (e.g. Fraud detection)
• and more…
2
4. What is a Data Warehouse
• A way to store large amount of operational
data to be able to analyze and create
comprehensive and intuitive reports
• A tool that gives management the ability to
access and analyze information about its
business
4
5. What is a Data Warehouse
• A data warehouse is a copy of transaction
data specifically structured for querying and
reporting.
• Large collection of integrated, non-volatile,
time variant data from multiple sources,
processed for storage in a multi-dimensional
model
Source: Ralph Kimball, Margy Ross, “Data Warehouse Toolkit”,
5
6. Characteristics of a DW
• Subject-oriented
– Data that gives information about a particular subject instead of about a company's on-going
operations (e.g. CUSTOMER, FINANCIAL INSTITUTION, VENDOR).
• Integrated
– Data that is gathered into the data warehouse from a variety of sources and merged into a
coherent whole, (standardize encoding, impose consistency in units of measure e.g. one
standard way to record a customer’s transaction across all systems).
• Non-volatile
– Data once loaded into the data warehouse does not change. Each data record represents a
distinct state (event).
• Time-variant
– Data is expected to store for long durations with time stamp to record its state (e.g..
sampling, summary or trend analysis).
6
8. OLTP vs. Data Warehouse
• OLTP
– Online transaction processing is used at the
routine operation level and supported by
transactional databases optimized for insertion,
updates, deletions and some low level queries.
• Data Warehousing
– Optimized for data retrieval, not routine
transaction processing and supports decision-
support applications.
8
9. OLTP vs. Data Warehouse
OLTP Data Warehouse
Current Data Historic Data
Detailed Lightly and highly Summarized data
Dynamic Static
High transaction throughput Low transaction throughput
Transaction driven Analysis driven
Serves large number users – low volume Low number of users – large volume
9
10. Designing a DW
• Top down
– Business Questions – Interview to see what the
business needs to know
• Bottom Up
– What data sources are available and what data is
stored
10
11. Reminder
• What is a Data Warehouse
– “Large collection of integrated, non-volatile, time
variant data from multiple sources, processed for
storage in a multi-dimensional model”
11
12. Dimensional Modeling
• Every dimensional model (DM) is composed of
one table with a composite primary key,
called the FACT table, and a set of smaller
tables called DIMENSION tables.
12
13. Dimensional Modeling
Customer
Vendor
Tim
e
PaymentsPayments
Notes:
This is a simple 3 dimensional data model
(Cube) that stores Payment Facts. The x, y, z
axis are representing the dimension tables and
what’s inside the cube is representing the
FACT table.
A real Dimensional Model is never this simple,
this is only a simple visual representation of
what it could look like. In real life it will
require many more dimensions to describe a
business process of a FACT.
Notes:
This is a simple 3 dimensional data model
(Cube) that stores Payment Facts. The x, y, z
axis are representing the dimension tables and
what’s inside the cube is representing the
FACT table.
A real Dimensional Model is never this simple,
this is only a simple visual representation of
what it could look like. In real life it will
require many more dimensions to describe a
business process of a FACT.
13
14. Dimensional Modeling Rules
• Each DIMENSION table has a simple (non-composite)
primary key that corresponds exactly to one of the
components of the composite key in the FACT table.
• Forms ‘star-like’ structure, which is called a star
schema or star join.
• All natural keys are replaced with surrogate keys.
Means that every join between FACT and DIMENSION
tables is based on surrogate keys, not natural keys.
• Surrogate keys allows data in the warehouse to have
some independence from the data used and produced
by the OLTP systems. (e.g. changing BINs)
14
15. Denormalizing
• Denormalizing
– DW data schema is denormalized or partially
denormalized to speed data retrieval.
• e.g. in a normalized DB we don’t store information that
can be calculated from stored information. In DW
design we do.
15
16. Star Schema
Payment FACT
Customer
Vendor
Time
Amount
Response Code
Customer
ID
Customer Num
Customer Name
Customer Age
Vendor
ID
Vendor Num
Vendor Name
Vendor Account
Num
Address
Time
ID
Date
Day of Week
Quarter
Month
Year
Day of Month
Note how time is
denormalized and
stored as a
dimension
Note how time is
denormalized and
stored as a
dimension
16
17. Why Denormalizing?
• Looking at Time Dimension table:
– We’re storing fields that can be calculated (such as day of week)
• For example if you are Safeway you want to see what day of week you
have the most customers to staff up. The question we ask the DW would
be “show me the average number of transactions we process on different
days of the week”)
– If we weren’t storing the day of week our DW would have to go
through millions of transactions, calculate the day of week based on
datestamp to match and return the results.
– This calculation is very time consuming and the response time would
be unacceptable.
– We denormalize to reduce the response time by storing more
information than [one could argue is] needed.
17
18. Fact Table
• Consists of measured or observed variables and identified via pointers pointing to
the dimension tables.
• Best to store facts that are numerical measurements, continuously valued and
additive (egg. in a Payment Fact table: amount, CustomerVendorAcct, traceNo,
returnCode, etc.).
• Each measurement is taken at the intersection of all the dimensions.
• Queries are made to the fact table which links to multiple records from the various
dimension tables to form the result set that will form the report.
• Fact table is sparse, if there is no value to be added, it is not filled.
• Fact Fields in the FACT table must be as minimal as possible
18
19. Dimension Tables
• Store descriptions of the dimensions of the business.
• Each textual description (attribute) helps to describe a property of
the respective dimension.
• Best to store attributes that are textual, discrete and used as the
source of constraints and row headers in the user’s answer set.
– For attribute that is a numerical measurement, if it varies continuously
every time it is sampled, store it as a fact, otherwise, store as a
dimensional attribute (e.g. standard cost of a product, if it does not
change often, store as a dimensional attribute).
19
20. FACTless Tables
• Something that happens, but nothing happens
– e.g. To track the Customers that are registered to
use Mobile Banking
• Answers Business questions like: “How many signed up
for this service but never used it?”
• A FACTless table contains only the keys linking
the defined dimension tables
20
21. 9 Steps DW Design Methodology
1. Choosing the process
2. Choosing the grain
3. Identifying and confirming the dimensions
4. Choosing the facts
5. Storing pre-calculations in the fact table
6. Rounding out the dimension tables
7. Choosing the duration of the database
8. Tracking slowly changing dimensions
9. Deciding the query priorities and the query
modes
Source: Ralph Kimball, Margy Ross, “Data Warehouse Toolkit”,
21
22. 9 Steps DW Design Methodology
• Step 1: Choosing Process
– The chosen process (function) refers to the subject matter
of a particular data mart, for example: a Bill Payment
Process
• Step 2: Choosing The Grain
– Decide what a record of the fact table is to represent, i.e..
the grain. For example, the grain is a single Payment
• Step 3: Identifying and conforming the dimensions
– Dimensions set the context for asking questions about the
facts in the fact table. e.g. Who made the Bill Payment
• Step 4: Choosing the Facts
– Facts should be numeric and additive.
22
23. 9 Steps DW Design Methodology
• Step 5: Storing pre-calculations in the fact table
– Once the facts have been selected each should be re-examined to determine
whether there are opportunities to use pre-calculations. (denormalization)
• Step 6: Rounding out the dimension tables
– What properties to include in dimension table to best describe it. Should be
intuitive and understandable
• Step 7: Choosing the duration of the database
– How long to keep the data for
• Step 8: Tracking slowly changing dimensions
– Type 1: where a changed dimension attribute is overwritten
– Type 2: where a changed dimension attribute causes a new dimension record
to be created
– Type 3: where a changed dimension attribute causes an alternate attribute to
be created so that both the old and new values of the attribute are
simultaneously accessible in the same dimension record
23
24. 9 Steps DW Design Methodology
• Step 9: Deciding the query priorities and the
query modes
– Consider physical decision issues
• Indexing for performance, Indexed Views, partitioning,
physical sort order, etc.
• Storage, backup, security
24
26. ETL
• Extraction, Transformation, Loading
• Tasks of capturing data by extracting from
source systems, cleansing (Transforming) it,
and finally loading results into target system.
• Can be carried out either by separate products, or by a
single integrated solution.
26
27. DW – Technology and DBMS
• MySQL
– Scale out not scale up.
• MySQL supports Clustering, Replication, etc. You can distribute
the DW across multiple Servers
– Fast database engine, specially for bulk inserts and selects
– Lots of Open Source tools available for ETL
– MySQL is a cheaper solution which makes it more attractive to
business to make the initial investment
27