USE BIG DATA TECHNOLOGIES TO
MODERNIZE YOUR ENTERPRISE
DATA WAREHOUSE
                  Most organizations’ enterprise data warehouses were built with online transaction
                  processing (OLTP)-centric technologies and architectures that are 15-20 years old.
                  Over the years, more data has been bolted on to these systems, and the query
                  load being driven by both traditional and mobile business intelligence products has
                  increased exponentially, resulting in brittle, over-burdened, costly data warehouses
                  that can take hours to return results. They don’t meet the growing data appetite of
                  the business, and don’t answer the questions needed to run the business at the
                  required levels of granularity, or at the necessary speed. Yet too much has been
                  invested in them to simply throw them out.

                  Big Data market dynamics have resulted in the creation of new technologies,
                  products, and approaches that can be used to modernize these stodgy, inflexible
                  data warehouses, and make them more responsive to the business—without
                  throwing out what is already in place. This paper describes four tactics that can be
                  implemented quickly, using an organization’s existing skill sets, and that can
EMC PERSPECTIVE   rapidly show a return on investment.
TACTIC #1: ACCELERATE YOUR DATA
                                                      WAREHOUSE WITH MPP-BASED
                                                      ARCHITECTURES
                                                      Massively Parallel Processing (MPP)-based databases provide a cost effective, scale-
                                                      out data warehouse environment that allows organizations to leverage Moore’s Law1
                                                      on performance-to-cost ratio improvements in x86 processors. MPP databases provide
       BENEFITS                                       a non-intrusive analytical platform/data warehouse for data discovery and exploratory
                                                      work on massive amounts of data. Built on inexpensive commodity clusters, MPP
       Leverage more detailed,
                                                      databases can extend, complement, or replace parts of your existing data warehouse,
       more robust dimensional
                                                      managing massive volumes of detailed data, while providing agile query, reporting,
       data
                                                      dashboards, and analytics (see Figure 1).
        •     Seasonality to forecast
              retail sales and energy                 MPP databases, while offering many of the same benefits as your existing data

              consumption                             warehouse, also provide the following advantages:

        •     Localization to pin point               •      Extreme scalability on general purpose systems

              lending or fraud exposure               •      Automatic parallelization

        •     Hyper-dimensionality for                •      Ability to load and query like any other database
              digital media attribution
                                                      •      Scanning and processing of all nodes in parallel
              or health care treatment
              analysis                                •      Extreme scalability and optimized I/O

                                                      •      Linear scalability to easily add nodes and storage

                                                      •      Improved query and loading performance




                                                      	
  

                                                      Figure	
  1:	
  MPP	
  Data	
  Warehouse	
  Architectures	
  Scale	
  Easily	
  to	
  Speed	
  Results	
  and	
  Process	
  More	
  
                                                      Data	
  

                                                      	
  

                                                      Figure	
  1:	
  MPP	
  Data	
  Warehouse	
  Architectures	
  Scale	
  Easily	
  to	
  Speed	
  Results	
  and	
  Process	
  More	
  
                                                      Data	
  




1
    Moore's law is the observation that over the history of computing hardware, the number of transistors on integrated circuits doubles approximately
every two years. The result is the doubling of computing power at the same cost every 18 to 24 months. http://en.wikipedia.org/wiki/Moore%27s_law
An MPP data warehouse will enable more granular data for query, reporting, and
                               dashboard drill-down and drill-across exploration. Analysis can be performed on
                               detailed data instead of data aggregates.

                               On the analytics side, once a model has been developed and business insights have
                               been gleaned from these data sets, simply migrate the model and/or the insights into
                               the existing data warehouse for integration into the current business intelligence
                               environment. Alternatively, analytic modeling can also be done on the MPP platform,
                               making it part of the production process.



                               TACTIC #2: STOP MOVING DATA TO THE
                               ANALYTICS; BRING THE ANALYTICS TO THE
                               DATA
BENEFITS                       One of the most dramatic developments in Big Data is the advent of in-database
                               analytics. In-database analytics addresses one of biggest shortcomings in performing
Leverage low-latency (high-    advanced analytics—the requirement to move large amounts of data around. That has
velocity) data access
                               caused many organizations and data scientists to have to settle with working with
 •   Drive realtime customer   aggregate tables because the data transfer issue is so debilitating to the analytic
     acquisition, predictive   exploration and discovery process. In-database analytics reverses the process by
     maintenance, or network   moving the analytic algorithms to where the data is stored, accelerating the
     optimization decisions    development and deployment of modeling. Elimination of data movement results in
                               substantial benefits:
 •   Update analytic models
     on-demand based upon      •      Moving a few terabytes can take hours. With in-database analytics, it drops to
     current market or local          zero.
     weather conditions
                               •      Because the movement of data is the most time-consuming activity in logical
                                      processing time, reducing data movement reduces the processing time by 1/N,
                                      where N is the number of processing units. Processing time for 1 TB can be
                                      reduced by a factor of 16 with only a five-processor system, going from 193
                                      minutes to 12 minutes (see Figure 2).




                               	
  

                               	
  

                               Figure	
  2:	
  In-­‐database	
  Analytics	
  Dramatically	
  Speeds	
  Processing	
  Time	
  
TACTIC #3: USE ALL OF YOUR DATA WITH A
BENEFITS                        NEXT GENERATION OPERATIONAL DATA STORE
                                The Hadoop Distributed File System (HDFS) provides a powerful yet inexpensive
Manage a wide variety of
                                option for modernizing Operational Data Store (ODS) and Data Staging areas. HDFS
structured and unstructured
data sources                    is a cost-effective, large storage system with an intrinsic computing and analytical
                                capability (MapReduce). Built on commodity clusters, HDFS simplifies the acquisition
 •   Integrate unstructured     and storage of diverse data sources, whether structured, semi-structured (e.g., web
     claims descriptions to     logs and sensor feeds), or unstructured (e.g., social media, image, video, and audio).
     reduce fraudulent claims   Once in the Hadoop file system, MapReduce and commercial Hadoop-based tools are
 •   Leverage mobile data to    available to prepare the data for loading into an existing data warehouse. The ability
     create realtime            to “define schema on query” versus “define schema on load” simplifies amassing data
     promotions                 from a variety of sources, even if you are not sure when and how you might use that
                                data later (see Figure 3).
 •   Leverage sensor readings
     to optimize yield and      The result is a single platform for feeding both your data warehouse and analytics
     pricing                    environment. This inexpensive, scale-out solution can be used to store ALL of your
                                data.




BENEFITS
                                Figure	
  3:	
  Use	
  Hadoop	
  as	
  an	
  Operational	
  Data	
  Store	
  and	
  Analyze	
  ALL	
  of	
  the	
  Data	
  
Leverage new metrics,
dimensions, and
dimensional attributes
gleaned from unstructured       TACTIC #4: LEVERAGE UNSTRUCTURED DATA
data sources
                                TO ADD NEW METRICS TO AN ENTERPRISE
 •   Leverage customers’
     interests, passions,
                                DATA WAREHOUSE
     associations, and          An easy way to start building experience with Hadoop and MapReduce is to use these

     affiliations to improve    technologies to create new metrics from an unstructured data source that can be fed

     micro-segmentation         into the enterprise data warehouse. This will provide the ability to leverage data such
                                as social, mobile, consumer comments, email, doctors’ notes, or claims descriptions
 •   Add sensor-generated
                                to identify new metrics that are better predictors of performance. Most organizations’
     performance data into
                                existing data warehouses are treasure troves of key performance indicators and
     your manufacturing,
                                metrics used to monitor business performance. Use Hadoop and MapReduce to parse
     supply chain, or product
                                through unstructured data to identify new business performance metrics that can be
     predictive maintenance
                                integrated into the existing data warehouse (see Figure 4).
     models
Figure	
  4:	
  Parse	
  Unstructured	
  Data	
  Using	
  Hadoop/MapReduce	
  and	
  Incorporate	
  Results	
  into	
  
the	
  Enterprise	
  Data	
  Warehouse	
  

Once these new metrics are in the enterprise data warehouse, they can be used to
enhance existing business intelligence queries, reports, dashboards, and analyses
(see Figure 5).




Figure	
  5:	
  Integrate	
  Social	
  Media	
  Metrics	
  into	
  the	
  Existing	
  BI	
  Environment	
  

Note: implementing this tactic places companies in a good position as Hadoop
continues its assimilation into the relational database market. Being able to create
metrics and process data on Hadoop, leveraging tools like HBase and Hive that are
evolving quickly, and having BI tools connect directly to HDFS, may make data
warehouse professionals question why they need to move data to a relational
database at all.

MODERNIZE YOUR DATA WAREHOUSE TODAY
In the world of revolutionary, game-changing Big Data developments, data
warehouse modernization may sound like an evolutionary development. However, it is
something that can be executed today, with existing data warehouse skills, and
represents a simple first step toward gleaning immediate business value and
organizational agility from Big Data technologies. Why are you waiting?
EMC CONSULTING
                                    As part of EMC® Corporation, the world’s leading developer and provider of
                                    information infrastructure technology and solutions, EMC Consulting provides
                                    strategic guidance and technology expertise to help organizations exploit information
                                    to its maximum potential. With worldwide expertise across organizations’ businesses,
                                    applications, and infrastructures, as well as deep industry understanding, EMC
                                    Consulting guides and delivers revolutionary thinking to help clients realize their
                                    ambitions in an information economy. EMC Consulting drives execution for its clients,
                                    including more than half of the Global Fortune 500 companies, to transform
                                    information into actionable strategies and tangible business results.




CONTACT US
For more information, visit
www.EMC.com/consulting or
contact your local EMC Consulting
representative.




                                    EMC2, EMC, and the EMC logo are registered trademarks or trademarks of EMC Corporation in the
                                    United States and other countries. © Copyright 2012 EMC Corporation. All rights reserved.
                                    Published in the USA. 08/12 EMC Perspective H10915

                                    EMC believes the information in this document is accurate as of its publication date. The
www.EMC.com                         information is subject to change without notice.

Use Big Data Technologies to Modernize Your Enterprise Data Warehouse

  • 1.
    USE BIG DATATECHNOLOGIES TO MODERNIZE YOUR ENTERPRISE DATA WAREHOUSE Most organizations’ enterprise data warehouses were built with online transaction processing (OLTP)-centric technologies and architectures that are 15-20 years old. Over the years, more data has been bolted on to these systems, and the query load being driven by both traditional and mobile business intelligence products has increased exponentially, resulting in brittle, over-burdened, costly data warehouses that can take hours to return results. They don’t meet the growing data appetite of the business, and don’t answer the questions needed to run the business at the required levels of granularity, or at the necessary speed. Yet too much has been invested in them to simply throw them out. Big Data market dynamics have resulted in the creation of new technologies, products, and approaches that can be used to modernize these stodgy, inflexible data warehouses, and make them more responsive to the business—without throwing out what is already in place. This paper describes four tactics that can be implemented quickly, using an organization’s existing skill sets, and that can EMC PERSPECTIVE rapidly show a return on investment.
  • 2.
    TACTIC #1: ACCELERATEYOUR DATA WAREHOUSE WITH MPP-BASED ARCHITECTURES Massively Parallel Processing (MPP)-based databases provide a cost effective, scale- out data warehouse environment that allows organizations to leverage Moore’s Law1 on performance-to-cost ratio improvements in x86 processors. MPP databases provide BENEFITS a non-intrusive analytical platform/data warehouse for data discovery and exploratory work on massive amounts of data. Built on inexpensive commodity clusters, MPP Leverage more detailed, databases can extend, complement, or replace parts of your existing data warehouse, more robust dimensional managing massive volumes of detailed data, while providing agile query, reporting, data dashboards, and analytics (see Figure 1). • Seasonality to forecast retail sales and energy MPP databases, while offering many of the same benefits as your existing data consumption warehouse, also provide the following advantages: • Localization to pin point • Extreme scalability on general purpose systems lending or fraud exposure • Automatic parallelization • Hyper-dimensionality for • Ability to load and query like any other database digital media attribution • Scanning and processing of all nodes in parallel or health care treatment analysis • Extreme scalability and optimized I/O • Linear scalability to easily add nodes and storage • Improved query and loading performance   Figure  1:  MPP  Data  Warehouse  Architectures  Scale  Easily  to  Speed  Results  and  Process  More   Data     Figure  1:  MPP  Data  Warehouse  Architectures  Scale  Easily  to  Speed  Results  and  Process  More   Data   1 Moore's law is the observation that over the history of computing hardware, the number of transistors on integrated circuits doubles approximately every two years. The result is the doubling of computing power at the same cost every 18 to 24 months. http://en.wikipedia.org/wiki/Moore%27s_law
  • 3.
    An MPP datawarehouse will enable more granular data for query, reporting, and dashboard drill-down and drill-across exploration. Analysis can be performed on detailed data instead of data aggregates. On the analytics side, once a model has been developed and business insights have been gleaned from these data sets, simply migrate the model and/or the insights into the existing data warehouse for integration into the current business intelligence environment. Alternatively, analytic modeling can also be done on the MPP platform, making it part of the production process. TACTIC #2: STOP MOVING DATA TO THE ANALYTICS; BRING THE ANALYTICS TO THE DATA BENEFITS One of the most dramatic developments in Big Data is the advent of in-database analytics. In-database analytics addresses one of biggest shortcomings in performing Leverage low-latency (high- advanced analytics—the requirement to move large amounts of data around. That has velocity) data access caused many organizations and data scientists to have to settle with working with • Drive realtime customer aggregate tables because the data transfer issue is so debilitating to the analytic acquisition, predictive exploration and discovery process. In-database analytics reverses the process by maintenance, or network moving the analytic algorithms to where the data is stored, accelerating the optimization decisions development and deployment of modeling. Elimination of data movement results in substantial benefits: • Update analytic models on-demand based upon • Moving a few terabytes can take hours. With in-database analytics, it drops to current market or local zero. weather conditions • Because the movement of data is the most time-consuming activity in logical processing time, reducing data movement reduces the processing time by 1/N, where N is the number of processing units. Processing time for 1 TB can be reduced by a factor of 16 with only a five-processor system, going from 193 minutes to 12 minutes (see Figure 2).     Figure  2:  In-­‐database  Analytics  Dramatically  Speeds  Processing  Time  
  • 4.
    TACTIC #3: USEALL OF YOUR DATA WITH A BENEFITS NEXT GENERATION OPERATIONAL DATA STORE The Hadoop Distributed File System (HDFS) provides a powerful yet inexpensive Manage a wide variety of option for modernizing Operational Data Store (ODS) and Data Staging areas. HDFS structured and unstructured data sources is a cost-effective, large storage system with an intrinsic computing and analytical capability (MapReduce). Built on commodity clusters, HDFS simplifies the acquisition • Integrate unstructured and storage of diverse data sources, whether structured, semi-structured (e.g., web claims descriptions to logs and sensor feeds), or unstructured (e.g., social media, image, video, and audio). reduce fraudulent claims Once in the Hadoop file system, MapReduce and commercial Hadoop-based tools are • Leverage mobile data to available to prepare the data for loading into an existing data warehouse. The ability create realtime to “define schema on query” versus “define schema on load” simplifies amassing data promotions from a variety of sources, even if you are not sure when and how you might use that data later (see Figure 3). • Leverage sensor readings to optimize yield and The result is a single platform for feeding both your data warehouse and analytics pricing environment. This inexpensive, scale-out solution can be used to store ALL of your data. BENEFITS Figure  3:  Use  Hadoop  as  an  Operational  Data  Store  and  Analyze  ALL  of  the  Data   Leverage new metrics, dimensions, and dimensional attributes gleaned from unstructured TACTIC #4: LEVERAGE UNSTRUCTURED DATA data sources TO ADD NEW METRICS TO AN ENTERPRISE • Leverage customers’ interests, passions, DATA WAREHOUSE associations, and An easy way to start building experience with Hadoop and MapReduce is to use these affiliations to improve technologies to create new metrics from an unstructured data source that can be fed micro-segmentation into the enterprise data warehouse. This will provide the ability to leverage data such as social, mobile, consumer comments, email, doctors’ notes, or claims descriptions • Add sensor-generated to identify new metrics that are better predictors of performance. Most organizations’ performance data into existing data warehouses are treasure troves of key performance indicators and your manufacturing, metrics used to monitor business performance. Use Hadoop and MapReduce to parse supply chain, or product through unstructured data to identify new business performance metrics that can be predictive maintenance integrated into the existing data warehouse (see Figure 4). models
  • 5.
    Figure  4:  Parse  Unstructured  Data  Using  Hadoop/MapReduce  and  Incorporate  Results  into   the  Enterprise  Data  Warehouse   Once these new metrics are in the enterprise data warehouse, they can be used to enhance existing business intelligence queries, reports, dashboards, and analyses (see Figure 5). Figure  5:  Integrate  Social  Media  Metrics  into  the  Existing  BI  Environment   Note: implementing this tactic places companies in a good position as Hadoop continues its assimilation into the relational database market. Being able to create metrics and process data on Hadoop, leveraging tools like HBase and Hive that are evolving quickly, and having BI tools connect directly to HDFS, may make data warehouse professionals question why they need to move data to a relational database at all. MODERNIZE YOUR DATA WAREHOUSE TODAY In the world of revolutionary, game-changing Big Data developments, data warehouse modernization may sound like an evolutionary development. However, it is something that can be executed today, with existing data warehouse skills, and represents a simple first step toward gleaning immediate business value and organizational agility from Big Data technologies. Why are you waiting?
  • 6.
    EMC CONSULTING As part of EMC® Corporation, the world’s leading developer and provider of information infrastructure technology and solutions, EMC Consulting provides strategic guidance and technology expertise to help organizations exploit information to its maximum potential. With worldwide expertise across organizations’ businesses, applications, and infrastructures, as well as deep industry understanding, EMC Consulting guides and delivers revolutionary thinking to help clients realize their ambitions in an information economy. EMC Consulting drives execution for its clients, including more than half of the Global Fortune 500 companies, to transform information into actionable strategies and tangible business results. CONTACT US For more information, visit www.EMC.com/consulting or contact your local EMC Consulting representative. EMC2, EMC, and the EMC logo are registered trademarks or trademarks of EMC Corporation in the United States and other countries. © Copyright 2012 EMC Corporation. All rights reserved. Published in the USA. 08/12 EMC Perspective H10915 EMC believes the information in this document is accurate as of its publication date. The www.EMC.com information is subject to change without notice.