Modern data warehouse

Modern Data Warehouse
Stephen Alex
BI & Big Data Architect

AGENDA
 History and Milestones
 Traditional Data Warehouse
 Key trends breaking the traditional data warehouse
 Modern Data Warehouse
 Multiple parallel processing (MPP) architecture
 Hadoop Ecosystem
 Technical Innovation on Hadoop
 Big Data Value Assessment
2Rolta AdvizeX Confidential & Proprietary 9/11/2016

History and Milestones
 1970’s: Relational Model Invented
 1984: DB2 released, RDBMS declared mainstream
 1990: RDBMS takes over
3Rolta AdvizeX Confidential & Proprietary 9/11/2016

The Traditional Data Warehouse
 Central repository for all internal data in a
company.
 Overall relational schema.
 The predictable data structure and quality
optimized processing and reporting.
 Data is in disk block formatting
 Fundamental operation is read a row
 Indexing via B-trees
 Dynamic row-level locking
 Data transfer usually EOD
4

Key Trends Breaking The Traditional Data Warehouse
5

Key Related Business and IT Trends
 Emerging Technologies are disruptive by nature and play a
key role in driving digital business and the related business
trends.
 Business Ecosystems enable each of the business trends,
and organizations are aggressively searching for ways to
leverage the role they play in the business ecosystem
 Business Moments provide opportunities to capture value
by setting in motion a series of events and actions involving a
network of people, businesses and things that spans or
crosses multiple industries and business ecosystems.
 Digital Economics seeks to harvest value from across the
business ecosystem by identifying business moments of
opportunity and exploiting the economics of connections.
This early-stage trend will have increasing importance as
business models evolve to leverage algorithmic business.
 Algorithmic Business propels organizations to leverage
business algorithms to drive value in the business
ecosystem. In this early-stage trend, we are starting to see
organizations transforming data with algorithms to drive
intelligent actions, particularly with the IoT.
6

The Risks of Bottlenecks in Data Movement
7

Hadoop Changes the Game
 Storage and Compute on One Platform
8

Modern Data Warehouse
9
 Incorporates Hadoop, traditional data
warehouses, and other data stores.
 Includes multiple repositories may
reside in different locations.
 Includes Data from cloud, mobile
devices, sensors, and the Internet of
Things
 Includes structured/semi-
structured/unstructured, raw data
 Inexpensive commodity hardware in
cluster mode

Multiple parallel processing (MPP) architecture
 Multiple parallel processing (MPP)
architecture enables extremely powerful
distributed computing and scale
 Resources can be added for a near linear
scale-out to the largest data warehousing
projects.
 MPP architecture uses a “shared-nothing”
There are multiple physical nodes, each
running its own instance. This results in
performance many times faster than
traditional architectures.
10

Apache Hadoop Ecosystem
 Hadoop ecosystem
components as part of
Apache Software
Foundation projects.
 The components are
categorized into file
system and data store,
serialization, job
execution, and others as
shown on the image.
11

Hadoop / BDD Ecosystem
Technology Purpose
Hadoop Distributed
File System
Distributed file system that provides high-throughput access to application data. Data is
split into blocks and distributed across multiple nodes in the cluster
Hadoop YARN Framework for job scheduling/monitoring and cluster resource management
Hive Facilitates ad hoc queries over data stored in HDFS. Uses HiveQL which is a SQL-like
language. Provides a relational view of data stored in HDFS.
HCatalog Hcatalog (aka Hive Metastore) provides a table and storage management layer for Hadoop
Spark Spark Powers a stack of high-level tools including Spark SQL, MLlib for machine learning,
GraphX, and Spark Streaming
Pig Pig is a high level platform for creating MapReduce programs. BDD uses Pig to manipulate
data prior to ingesting via data processing.

Technology Purpose
Oozie Oozie is the workflow scheduler system to manage Apache Hadoop jobs. BDD
uses Oozie for workflow management (sampling, profiling, enrichment).
Sqoop Tool for efficiently transferring bulk data between Hadoop and structured
datastores such a relational database
Flume Tool for efficiently collecting, aggregating and moving large amounts of streaming
data into the HDFS
ZooKeeper Zookeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services
Hue Hue is a set of web applications that enable you to interact with CDH cluster.
Hadoop / BDD Ecosystem

Oracle BDD Technical Innovation on Hadoop
15
Key Features and Functionality:
Find
• Access a rich, interactive catalog of all data in Hadoop
• Use familiar search and guided navigation to find information quickly
• See data set summaries, user annotation and recommendations
• Provision personal and enterprise data to Hadoop via self-service
Explore
• Visualize all attributes by type
• Sort attributes by information potential
• Assess attribute statistics, data quality and outliers
• Use a scratch pad to uncover correlations between attributes
Transform
• Get the data ready for analytics via Intuitive, user driven data wrangling
• Leverage an extensive library of data transformations and enrichments
• Preview results, undo, commit and replay transforms
• Test on sample data in memory then apply to full data set in Hadoop
Discover
• Join and blend data for deeper perspectives
• Compose project pages via drag and drop
• Use powerful search and guided navigation to ask questions
• See new patterns in rich, interactive data visualizations
Share
• Share projects, bookmarks and snapshots with others
• Build galleries and tell Big Data stories
• Collaborate and iterate as a team
• Publish blended data to HDFS for leverage in other tools

Components of Big Data Discovery
16

Big Data Value Assessment
17
Descriptive analytics looks at past performance and understands that
performance by mining historical data to look for the reasons behind past
success or failure and that is the traditional BI work.
Predictive analytics answers the question what will happen. This is when
historical performance data is combined with rules, algorithms, and external
data to determine the probable future outcome of an event or the likelihood
of a situation occurring.
Prescriptive analytics not only anticipates what will happen and when it will
happen, but also why it will happen.
Basic Analytics
Advanced Analytics
Prescriptive
Predictive
Descriptive

Thank You!!!
Stephen Alex
BI & Big Data Architect
(732) 485-0011(m)
9/11/201618
Rolta AdvizeX Proprietary and Confidential

Modern data warehouse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Modern data warehouse

Similar to Modern data warehouse (20)

Recently uploaded

Recently uploaded (20)

Modern data warehouse