Modern Data Warehouse
Stephen Alex
BI & Big Data Architect
AGENDA
 History and Milestones
 Traditional Data Warehouse
 Key trends breaking the traditional data warehouse
 Modern Data Warehouse
 Multiple parallel processing (MPP) architecture
 Hadoop Ecosystem
 Technical Innovation on Hadoop
 Big Data Value Assessment
2Rolta AdvizeX Confidential & Proprietary 9/11/2016
History and Milestones
 1970’s: Relational Model Invented
 1984: DB2 released, RDBMS declared mainstream
 1990: RDBMS takes over
3Rolta AdvizeX Confidential & Proprietary 9/11/2016
The Traditional Data Warehouse
 Central repository for all internal data in a
company.
 Overall relational schema.
 The predictable data structure and quality
optimized processing and reporting.
 Data is in disk block formatting
 Fundamental operation is read a row
 Indexing via B-trees
 Dynamic row-level locking
 Data transfer usually EOD
4
Key Trends Breaking The Traditional Data Warehouse
5
Key Related Business and IT Trends
 Emerging Technologies are disruptive by nature and play a
key role in driving digital business and the related business
trends.
 Business Ecosystems enable each of the business trends,
and organizations are aggressively searching for ways to
leverage the role they play in the business ecosystem
 Business Moments provide opportunities to capture value
by setting in motion a series of events and actions involving a
network of people, businesses and things that spans or
crosses multiple industries and business ecosystems.
 Digital Economics seeks to harvest value from across the
business ecosystem by identifying business moments of
opportunity and exploiting the economics of connections.
This early-stage trend will have increasing importance as
business models evolve to leverage algorithmic business.
 Algorithmic Business propels organizations to leverage
business algorithms to drive value in the business
ecosystem. In this early-stage trend, we are starting to see
organizations transforming data with algorithms to drive
intelligent actions, particularly with the IoT.
6
The Risks of Bottlenecks in Data Movement
7
Hadoop Changes the Game
 Storage and Compute on One Platform
8
Modern Data Warehouse
9
 Incorporates Hadoop, traditional data
warehouses, and other data stores.
 Includes multiple repositories may
reside in different locations.
 Includes Data from cloud, mobile
devices, sensors, and the Internet of
Things
 Includes structured/semi-
structured/unstructured, raw data
 Inexpensive commodity hardware in
cluster mode
Multiple parallel processing (MPP) architecture
 Multiple parallel processing (MPP)
architecture enables extremely powerful
distributed computing and scale
 Resources can be added for a near linear
scale-out to the largest data warehousing
projects.
 MPP architecture uses a “shared-nothing”
There are multiple physical nodes, each
running its own instance. This results in
performance many times faster than
traditional architectures.
10
Apache Hadoop Ecosystem
 Hadoop ecosystem
components as part of
Apache Software
Foundation projects.
 The components are
categorized into file
system and data store,
serialization, job
execution, and others as
shown on the image.
11
Hadoop / BDD Ecosystem
Technology Purpose
Hadoop Distributed
File System
Distributed file system that provides high-throughput access to application data. Data is
split into blocks and distributed across multiple nodes in the cluster
Hadoop YARN Framework for job scheduling/monitoring and cluster resource management
Hive Facilitates ad hoc queries over data stored in HDFS. Uses HiveQL which is a SQL-like
language. Provides a relational view of data stored in HDFS.
HCatalog Hcatalog (aka Hive Metastore) provides a table and storage management layer for Hadoop
Spark Spark Powers a stack of high-level tools including Spark SQL, MLlib for machine learning,
GraphX, and Spark Streaming
Pig Pig is a high level platform for creating MapReduce programs. BDD uses Pig to manipulate
data prior to ingesting via data processing.
Technology Purpose
Oozie Oozie is the workflow scheduler system to manage Apache Hadoop jobs. BDD
uses Oozie for workflow management (sampling, profiling, enrichment).
Sqoop Tool for efficiently transferring bulk data between Hadoop and structured
datastores such a relational database
Flume Tool for efficiently collecting, aggregating and moving large amounts of streaming
data into the HDFS
ZooKeeper Zookeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services
Hue Hue is a set of web applications that enable you to interact with CDH cluster.
Hadoop / BDD Ecosystem
Top Three Hadoop Vendors
14
Oracle BDD Technical Innovation on Hadoop
15
Key Features and Functionality:
Find
• Access a rich, interactive catalog of all data in Hadoop
• Use familiar search and guided navigation to find information quickly
• See data set summaries, user annotation and recommendations
• Provision personal and enterprise data to Hadoop via self-service
Explore
• Visualize all attributes by type
• Sort attributes by information potential
• Assess attribute statistics, data quality and outliers
• Use a scratch pad to uncover correlations between attributes
Transform
• Get the data ready for analytics via Intuitive, user driven data wrangling
• Leverage an extensive library of data transformations and enrichments
• Preview results, undo, commit and replay transforms
• Test on sample data in memory then apply to full data set in Hadoop
Discover
• Join and blend data for deeper perspectives
• Compose project pages via drag and drop
• Use powerful search and guided navigation to ask questions
• See new patterns in rich, interactive data visualizations
Share
• Share projects, bookmarks and snapshots with others
• Build galleries and tell Big Data stories
• Collaborate and iterate as a team
• Publish blended data to HDFS for leverage in other tools
Components of Big Data Discovery
16
Big Data Value Assessment
17
Descriptive analytics looks at past performance and understands that
performance by mining historical data to look for the reasons behind past
success or failure and that is the traditional BI work.
Predictive analytics answers the question what will happen. This is when
historical performance data is combined with rules, algorithms, and external
data to determine the probable future outcome of an event or the likelihood
of a situation occurring.
Prescriptive analytics not only anticipates what will happen and when it will
happen, but also why it will happen.
Basic Analytics
Advanced Analytics
Prescriptive
Predictive
Descriptive
Thank You!!!
Stephen Alex
BI & Big Data Architect
(732) 485-0011(m)
9/11/201618
Rolta AdvizeX Proprietary and Confidential

Modern data warehouse

  • 1.
    Modern Data Warehouse StephenAlex BI & Big Data Architect
  • 2.
    AGENDA  History andMilestones  Traditional Data Warehouse  Key trends breaking the traditional data warehouse  Modern Data Warehouse  Multiple parallel processing (MPP) architecture  Hadoop Ecosystem  Technical Innovation on Hadoop  Big Data Value Assessment 2Rolta AdvizeX Confidential & Proprietary 9/11/2016
  • 3.
    History and Milestones 1970’s: Relational Model Invented  1984: DB2 released, RDBMS declared mainstream  1990: RDBMS takes over 3Rolta AdvizeX Confidential & Proprietary 9/11/2016
  • 4.
    The Traditional DataWarehouse  Central repository for all internal data in a company.  Overall relational schema.  The predictable data structure and quality optimized processing and reporting.  Data is in disk block formatting  Fundamental operation is read a row  Indexing via B-trees  Dynamic row-level locking  Data transfer usually EOD 4
  • 5.
    Key Trends BreakingThe Traditional Data Warehouse 5
  • 6.
    Key Related Businessand IT Trends  Emerging Technologies are disruptive by nature and play a key role in driving digital business and the related business trends.  Business Ecosystems enable each of the business trends, and organizations are aggressively searching for ways to leverage the role they play in the business ecosystem  Business Moments provide opportunities to capture value by setting in motion a series of events and actions involving a network of people, businesses and things that spans or crosses multiple industries and business ecosystems.  Digital Economics seeks to harvest value from across the business ecosystem by identifying business moments of opportunity and exploiting the economics of connections. This early-stage trend will have increasing importance as business models evolve to leverage algorithmic business.  Algorithmic Business propels organizations to leverage business algorithms to drive value in the business ecosystem. In this early-stage trend, we are starting to see organizations transforming data with algorithms to drive intelligent actions, particularly with the IoT. 6
  • 7.
    The Risks ofBottlenecks in Data Movement 7
  • 8.
    Hadoop Changes theGame  Storage and Compute on One Platform 8
  • 9.
    Modern Data Warehouse 9 Incorporates Hadoop, traditional data warehouses, and other data stores.  Includes multiple repositories may reside in different locations.  Includes Data from cloud, mobile devices, sensors, and the Internet of Things  Includes structured/semi- structured/unstructured, raw data  Inexpensive commodity hardware in cluster mode
  • 10.
    Multiple parallel processing(MPP) architecture  Multiple parallel processing (MPP) architecture enables extremely powerful distributed computing and scale  Resources can be added for a near linear scale-out to the largest data warehousing projects.  MPP architecture uses a “shared-nothing” There are multiple physical nodes, each running its own instance. This results in performance many times faster than traditional architectures. 10
  • 11.
    Apache Hadoop Ecosystem Hadoop ecosystem components as part of Apache Software Foundation projects.  The components are categorized into file system and data store, serialization, job execution, and others as shown on the image. 11
  • 12.
    Hadoop / BDDEcosystem Technology Purpose Hadoop Distributed File System Distributed file system that provides high-throughput access to application data. Data is split into blocks and distributed across multiple nodes in the cluster Hadoop YARN Framework for job scheduling/monitoring and cluster resource management Hive Facilitates ad hoc queries over data stored in HDFS. Uses HiveQL which is a SQL-like language. Provides a relational view of data stored in HDFS. HCatalog Hcatalog (aka Hive Metastore) provides a table and storage management layer for Hadoop Spark Spark Powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming Pig Pig is a high level platform for creating MapReduce programs. BDD uses Pig to manipulate data prior to ingesting via data processing.
  • 13.
    Technology Purpose Oozie Oozieis the workflow scheduler system to manage Apache Hadoop jobs. BDD uses Oozie for workflow management (sampling, profiling, enrichment). Sqoop Tool for efficiently transferring bulk data between Hadoop and structured datastores such a relational database Flume Tool for efficiently collecting, aggregating and moving large amounts of streaming data into the HDFS ZooKeeper Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services Hue Hue is a set of web applications that enable you to interact with CDH cluster. Hadoop / BDD Ecosystem
  • 14.
    Top Three HadoopVendors 14
  • 15.
    Oracle BDD TechnicalInnovation on Hadoop 15 Key Features and Functionality: Find • Access a rich, interactive catalog of all data in Hadoop • Use familiar search and guided navigation to find information quickly • See data set summaries, user annotation and recommendations • Provision personal and enterprise data to Hadoop via self-service Explore • Visualize all attributes by type • Sort attributes by information potential • Assess attribute statistics, data quality and outliers • Use a scratch pad to uncover correlations between attributes Transform • Get the data ready for analytics via Intuitive, user driven data wrangling • Leverage an extensive library of data transformations and enrichments • Preview results, undo, commit and replay transforms • Test on sample data in memory then apply to full data set in Hadoop Discover • Join and blend data for deeper perspectives • Compose project pages via drag and drop • Use powerful search and guided navigation to ask questions • See new patterns in rich, interactive data visualizations Share • Share projects, bookmarks and snapshots with others • Build galleries and tell Big Data stories • Collaborate and iterate as a team • Publish blended data to HDFS for leverage in other tools
  • 16.
    Components of BigData Discovery 16
  • 17.
    Big Data ValueAssessment 17 Descriptive analytics looks at past performance and understands that performance by mining historical data to look for the reasons behind past success or failure and that is the traditional BI work. Predictive analytics answers the question what will happen. This is when historical performance data is combined with rules, algorithms, and external data to determine the probable future outcome of an event or the likelihood of a situation occurring. Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen. Basic Analytics Advanced Analytics Prescriptive Predictive Descriptive
  • 18.
    Thank You!!! Stephen Alex BI& Big Data Architect (732) 485-0011(m) 9/11/201618 Rolta AdvizeX Proprietary and Confidential