Presented by: Jonathan Bloom
Senior BI Consultant, Agile Bay, Inc.
Jonathan Bloom
Current Position:
Senior BI Consultant
Customers & Partners
Blog: http://www.BloomConsultingBI.com
Twitter: @SQLJon
Linked-in: http://www.linkedin.com/BloomConsultintBI
Email: JBloom@agilebay.com
w w w . a g i l e b a y. c o m
Agenda
EDW
Hybrid Data Warehouse
Hadoop
Q&A
Why EDW?
Convert Data to Information
 Accumulating Data
 Manage the Business
 OLTP != Reporting
 Apply Business Rules
 Clean Data
 Analytics
 Proven Framework
EDW Role
 Reporting Lifecycle
 Domain Knowledge
 Interact with Business
 Gather Specs
 Estimate Time
 Knowledge of Database
 SQL Skills
 Change Management
EDW Architecture
 Source System
 Staging
 Raw
 Master Data Services
 Enterprise Data Warehouse
 Analysis Services Cubes
Data Modeling
 Kimball Methodology
 Star Schema
 Pattern forms a graphical “Star”
 Snow Flake Schema
 Branches
Tables
 Dimension Tables
 Describe Data
 Fact Tables
 Measures (Sums, Counts, Max, Min, etc.)
 Contain Surrogate Keys
 Link back to Dim Tables
Slowly Changing Dimensions
 Type 0 method is passive
 Values remain as they were at the time the dimension
record was first inserted
 Type 1
 Overwrites old with new data
 Does not track historical data
 Type 2
 Tracks historical data by creating multiple records
Date Dimensions
 Create Scripts
 Fiscal Year
 Custom Start / End Dates
 Key Example: 20140226
Dim Tables
Fact Tables
Fact Table Keys
Analysis Server
Cubes (Measure Groups)
Cubes (Dimensions)
SSAS
 Create Connections
 Add Data Sources
 Create Relationships
 Add Dimensions
 Create Measure Groups / Measures
 Create Calculated Measures
 Create Hierarchy’s
Integrate Hadoop with EDW
Hadoop
 Open Source Community
 Distribute Parallel Processing
 Commodity hardware
 Large Data sets
 Semi - Un – Structured Data
Data Gosinta (goes into)
 When thinking about Hadoop, we think of
data. How to get data into HDFS and how
to get data out of HDFS. Luckily, Hadoop
has some popular processes to accomplish
this.
SQOOP
 SQOOP was created to move data back and forth
easily from an External Database or flat file into
HDFS or HIVE. There are some standard commands
for moving data by Importing and Exporting
data. When data is moved to HDFS, it creates files
on the HDFS folder system. Those folders can be
partitioned in a variety of ways. Data can be
appended to the files through SQOOP jobs. And
you can add a WHERE clause to pull just certain
data, for example, just bring in data from
yesterday, run the SQOOP job daily to populate
Hadoop.
Hive
 Once data gets moved to Hadoop HDFS, you
can add a layer of HIVE on top which
structures the data into relational
format. Once applied, the data can be queried
by HIVE SQL. If creating a table, in the HIVE
database schema, you can create an External
table which is basically a metadata layer pass
through which points to the actual data. So if
you drop the External table, the data remains
in tact.
ODBC
 From HIVE SQL, the tables are exposed to
ODBC to allow data to be accessed via
Reports, Databases, ETL, etc.
So as you can see from the basic description
above, if you can move data back and forth
easily between Hadoop and your Relational
Database (or flat files).
PIG
 In addition, you can use a Hadoop language
called PIG (not making this up), to massage
the data into a structure series of steps, a
form of ETL if you will.
Hybrid Data Warehouse
 You can keep the data up to data by using
SQOOP, then add data from a variety of
systems to build a Hybrid Data
Warehouse. As Data Warehousing is a
concept, a documented framework to follow
with guidelines and rules. And storing the
data in Hadoop and Relational Databases is
typically known as a Hybrid Data
Warehouse.
Connect to the Data
 Once data is stored in HDW, it can be
consumed by users via HIVE ODBC or
Microsoft PowerBI, Tableau, Qlikview or
SAP HANA or a variety of other tools sitting
on top of the data layer, including Self
Service tools.
Machine Learning
 In addition, you could apply MAHOUT
Machine Learning algorithms to you
Hadoop cluster for Clustering, Classification
and Collaborative Filtering. And you can
run Statistical language analysis with a
language called Revolution Analytic R
version of Hadoop R.
Streaming
 And you can receive Steaming Data.
Monitor
 There's Zookeeper which is a centralized
service to keep track of things.
Graph
 And Girage, which allows Hadoop the ability
to process Graph connections between
nodes.
In Memory
 And Spark, which allows faster processing
by by-passing Map Reduce and ability to run
In Memory
Cloud
 You can run your Hybrid Data Warehouse in
the Cloud with Microsoft Azure Blobstorage
HDInsight or Amazon Web Services.
On Premise
 You can run On Premise with IBM
Infosphere
BigInsights, Cloudera, Hortonworks and
MapR.
Hadoop 2.0
 And with the latest Hadoop 2.0, there's the addition
of YARN which is a new layer that sits between
HDFS2 and the application layers. Although HDFS
Map Reduce was originally designed as the sole
batch oriented approach to getting data from
HDFS, it's no longer the sole way. HIVE SQL has
been sped up through Impala which completely
bypasses Map Reduce and the Stinger initiative
which sits atop Tez. Tez has ability to compress data
with column stores which allows the interaction to
be sped up.
New Features
 With Hadoop 2.0, you can now monitor
your clusters with Ambari which has an API
layer for 3rd party tools to hook into. A well
known limitation of Hadoop has been
Security which has now been addressed as
well.
HBase
 Hbase allows a separate database to allow
random read/write access to the HDFS
data, and surprisingly it too sits with the
HDFS cluster. Data can be ingested to
HBASE and interpreted On Read, which
Relational Databases do not offer.
HCatalog
 Sometimes when developing, users don't know
where data is stored. And sometimes the data
can be stored in a variety of formats, because
HIVE, PIG and Map Reduce can have separate
data model types. So HCatalog was created to
alleviate some of the frustration. It's a table
abstraction layer, meta data service and a
shared schema for Pig, Hive and M/R. It
exposes info about the data to applications.
Hadoop
Future
 OLTP?
 Artificial Intelligence
 Neural Networks
 Robots
Summary
EDW is a concept / framework
Ingest Data
ETL
Output / Reports / Analytics
Stay Current
Never stop learning!
Blog: www.BloomConsultingBI.com
Twitter: @SQLJon
Linked-in: www.linkedin.com/BloomConsultingBI
Email: JBloom@agilebay.com

Intro to Hybrid Data Warehouse

  • 1.
    Presented by: JonathanBloom Senior BI Consultant, Agile Bay, Inc.
  • 2.
    Jonathan Bloom Current Position: SeniorBI Consultant Customers & Partners Blog: http://www.BloomConsultingBI.com Twitter: @SQLJon Linked-in: http://www.linkedin.com/BloomConsultintBI Email: JBloom@agilebay.com
  • 3.
    w w w. a g i l e b a y. c o m
  • 4.
  • 5.
  • 6.
    Convert Data toInformation  Accumulating Data  Manage the Business  OLTP != Reporting  Apply Business Rules  Clean Data  Analytics  Proven Framework
  • 7.
    EDW Role  ReportingLifecycle  Domain Knowledge  Interact with Business  Gather Specs  Estimate Time  Knowledge of Database  SQL Skills  Change Management
  • 8.
    EDW Architecture  SourceSystem  Staging  Raw  Master Data Services  Enterprise Data Warehouse  Analysis Services Cubes
  • 9.
    Data Modeling  KimballMethodology  Star Schema  Pattern forms a graphical “Star”  Snow Flake Schema  Branches
  • 10.
    Tables  Dimension Tables Describe Data  Fact Tables  Measures (Sums, Counts, Max, Min, etc.)  Contain Surrogate Keys  Link back to Dim Tables
  • 11.
    Slowly Changing Dimensions Type 0 method is passive  Values remain as they were at the time the dimension record was first inserted  Type 1  Overwrites old with new data  Does not track historical data  Type 2  Tracks historical data by creating multiple records
  • 12.
    Date Dimensions  CreateScripts  Fiscal Year  Custom Start / End Dates  Key Example: 20140226
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    SSAS  Create Connections Add Data Sources  Create Relationships  Add Dimensions  Create Measure Groups / Measures  Create Calculated Measures  Create Hierarchy’s
  • 20.
  • 21.
    Hadoop  Open SourceCommunity  Distribute Parallel Processing  Commodity hardware  Large Data sets  Semi - Un – Structured Data
  • 22.
    Data Gosinta (goesinto)  When thinking about Hadoop, we think of data. How to get data into HDFS and how to get data out of HDFS. Luckily, Hadoop has some popular processes to accomplish this.
  • 23.
    SQOOP  SQOOP wascreated to move data back and forth easily from an External Database or flat file into HDFS or HIVE. There are some standard commands for moving data by Importing and Exporting data. When data is moved to HDFS, it creates files on the HDFS folder system. Those folders can be partitioned in a variety of ways. Data can be appended to the files through SQOOP jobs. And you can add a WHERE clause to pull just certain data, for example, just bring in data from yesterday, run the SQOOP job daily to populate Hadoop.
  • 24.
    Hive  Once datagets moved to Hadoop HDFS, you can add a layer of HIVE on top which structures the data into relational format. Once applied, the data can be queried by HIVE SQL. If creating a table, in the HIVE database schema, you can create an External table which is basically a metadata layer pass through which points to the actual data. So if you drop the External table, the data remains in tact.
  • 25.
    ODBC  From HIVESQL, the tables are exposed to ODBC to allow data to be accessed via Reports, Databases, ETL, etc. So as you can see from the basic description above, if you can move data back and forth easily between Hadoop and your Relational Database (or flat files).
  • 26.
    PIG  In addition,you can use a Hadoop language called PIG (not making this up), to massage the data into a structure series of steps, a form of ETL if you will.
  • 27.
    Hybrid Data Warehouse You can keep the data up to data by using SQOOP, then add data from a variety of systems to build a Hybrid Data Warehouse. As Data Warehousing is a concept, a documented framework to follow with guidelines and rules. And storing the data in Hadoop and Relational Databases is typically known as a Hybrid Data Warehouse.
  • 28.
    Connect to theData  Once data is stored in HDW, it can be consumed by users via HIVE ODBC or Microsoft PowerBI, Tableau, Qlikview or SAP HANA or a variety of other tools sitting on top of the data layer, including Self Service tools.
  • 29.
    Machine Learning  Inaddition, you could apply MAHOUT Machine Learning algorithms to you Hadoop cluster for Clustering, Classification and Collaborative Filtering. And you can run Statistical language analysis with a language called Revolution Analytic R version of Hadoop R.
  • 30.
    Streaming  And youcan receive Steaming Data.
  • 31.
    Monitor  There's Zookeeperwhich is a centralized service to keep track of things.
  • 32.
    Graph  And Girage,which allows Hadoop the ability to process Graph connections between nodes.
  • 33.
    In Memory  AndSpark, which allows faster processing by by-passing Map Reduce and ability to run In Memory
  • 34.
    Cloud  You canrun your Hybrid Data Warehouse in the Cloud with Microsoft Azure Blobstorage HDInsight or Amazon Web Services.
  • 35.
    On Premise  Youcan run On Premise with IBM Infosphere BigInsights, Cloudera, Hortonworks and MapR.
  • 36.
    Hadoop 2.0  Andwith the latest Hadoop 2.0, there's the addition of YARN which is a new layer that sits between HDFS2 and the application layers. Although HDFS Map Reduce was originally designed as the sole batch oriented approach to getting data from HDFS, it's no longer the sole way. HIVE SQL has been sped up through Impala which completely bypasses Map Reduce and the Stinger initiative which sits atop Tez. Tez has ability to compress data with column stores which allows the interaction to be sped up.
  • 37.
    New Features  WithHadoop 2.0, you can now monitor your clusters with Ambari which has an API layer for 3rd party tools to hook into. A well known limitation of Hadoop has been Security which has now been addressed as well.
  • 38.
    HBase  Hbase allowsa separate database to allow random read/write access to the HDFS data, and surprisingly it too sits with the HDFS cluster. Data can be ingested to HBASE and interpreted On Read, which Relational Databases do not offer.
  • 39.
    HCatalog  Sometimes whendeveloping, users don't know where data is stored. And sometimes the data can be stored in a variety of formats, because HIVE, PIG and Map Reduce can have separate data model types. So HCatalog was created to alleviate some of the frustration. It's a table abstraction layer, meta data service and a shared schema for Pig, Hive and M/R. It exposes info about the data to applications.
  • 40.
  • 41.
    Future  OLTP?  ArtificialIntelligence  Neural Networks  Robots
  • 43.
    Summary EDW is aconcept / framework Ingest Data ETL Output / Reports / Analytics Stay Current Never stop learning!
  • 44.
    Blog: www.BloomConsultingBI.com Twitter: @SQLJon Linked-in:www.linkedin.com/BloomConsultingBI Email: JBloom@agilebay.com