Big Data
    Analytics:
     Profiling the Use of Analytical
     Platforms in User Organizations
BY WAYNE ECKERSON
 Director of Research, Business Applications and Architecture Group, TechTarget, September 2011




                   BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   1
FROM OUR SPONSORS
EXECUTIVE SUMMARY




                    Executive Summary

   EXECUTIVE        THIS REPORT EXAMINES    the rise of “big data” and the use of analytics to mine
   SUMMARY
                    that data. Companies have been storing and analyzing large volumes of data
                    since the advent of the data warehousing movement in the early 1990s. While
   RESEARCH         terabytes used to be synonymous with big data warehouses, now it’s peta-
  BACKGROUND        bytes, and the rate of growth in data volumes continues to escalate as organi-
                    zations seek to store and analyze greater levels of transaction details, as well
                    as Web- and machine-generated data, to gain a better understanding of cus-
 WHY BIG DATA?
                    tomer behavior and drivers.

    BIG DATA
                    I Analytical platforms. To keep pace with the desire to store and analyze ever
   ANALYTICS:       larger volumes of structured data, relational database vendors have delivered
 DERIVING VALUE
 FROM BIG DATA      specialized analytical platforms that pro-
                    vide dramatically higher levels of price-per-
                    formance compared with general-purpose
 ARCHITECTURE       relational database management systems              Companies have been
  FOR BIG DATA
   ANALYTICS        (RDBMSs). These analytical platforms                storing and analyzing
                    come in many shapes and sizes, from soft-
                    ware-only databases and analytical appli-           large volumes of data
 PLATFORMS FOR      ances to analytical services that run in a          since the advent of
RUNNING BIG DATA
   ANALYTICS        third-party hosted environment. Almost              the data warehousing
                    three-quarters (72%) of our survey respon-
                                                                        movement in the
                    dents said they have implemented an ana-
PROFILING THE USE   lytical platform that fits this description.        early 1990s.
 OF ANALYTICAL
   PLATFORMS           In addition, new technologies have
                    emerged to address exploding volumes of
                    complex structured data, including Web
 RECOMMENDA-        traffic, social media content and machine-generated data, such as sensor and
     TIONS
                    Global Positioning System (GPS) data. New nonrelational database vendors
                    combine text indexing and natural language processing techniques with tradi-
                    tional database technology to optimize ad hoc queries against semi-struc-
                    tured data. And many Internet and media companies use new open source


                            BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   3
EXECUTIVE SUMMARY



                    frameworks such as Hadoop and MapReduce to store and process large vol-
                    umes of structured and unstructured data in batch jobs that run on clusters of
                    commodity servers.

                    I Business users. In the midst of these platform innovations, business users
                    await tools geared to their information requirements. Casual users—execu-
   EXECUTIVE        tives, managers, front-line workers—primarily use reports and dashboards
   SUMMARY
                    that deliver answers to predefined ques-
                    tions. Power users—business analysts,
   RESEARCH         analytical modelers and data scientists—
  BACKGROUND        perform ad hoc queries against a variety         Most business intelligence
                    of sources. Most business intelligence           (BI) environments have
                    (BI) environments have done a poor job
 WHY BIG DATA?
                    meeting these diverse needs within a
                                                                     done a poor job meeting
                    single, unified architecture. But this is        these diverse needs
    BIG DATA        changing.                                        within a single, unified
   ANALYTICS:
 DERIVING VALUE                                                            architecture. But this is
 FROM BIG DATA
                    I Unified architecture. This report por-
                    trays a unified reporting and analysis           changing.
                    environment that finally turns power
 ARCHITECTURE       users into first-class corporate citizens
  FOR BIG DATA
   ANALYTICS        and makes unstructured data a legiti-
                    mate target for ad hoc and batch queries. The new architecture leverages new
                    analytical technology to stage, store and process large volumes of structured
 PLATFORMS FOR      and unstructured data, turbo-charge sluggish data warehouses and offload
RUNNING BIG DATA
   ANALYTICS        complex analytical queries to dedicated data marts. Besides supporting stan-
                    dard reports and dashboards, it creates a series of analytical sandboxes that
                    enable power users to mix personal and corporate data and run complex ana-
PROFILING THE USE   lytical queries that fuel the modern-day corporation. I
 OF ANALYTICAL
   PLATFORMS




 RECOMMENDA-
     TIONS




                            BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   4
RESEARCH BACKGROUND




                    Research Background

   EXECUTIVE        THE PURPOSE OF   this report is to profile the use of analytical platforms in user
   SUMMARY
                    organizations. It is based on a survey of 302 BI professionals as well as inter-
                    views with BI practitioners at user organizations and BI experts at consultan-
   RESEARCH         cies and software companies.
  BACKGROUND

                    I Survey. The survey consists of 25 pages of questions (approximately 50
                    questions) with four branches, one for each analytical platform deployment
 WHY BIG DATA?
                    option: analytical database (software-only),
                    analytical appliance (hardware-software
    BIG DATA        combo), analytical service and file-based
   ANALYTICS:       analytical system (e.g., Hadoop and                  [This report] is based
 DERIVING VALUE
 FROM BIG DATA      NoSQL). Respondents who didn’t select an
                    option were passed to a fifth branch where           on a survey of 302
                    they were asked why they hadn’t purchased            BI professionals as
 ARCHITECTURE       an analytical platform and whether they              well as interviews with
  FOR BIG DATA
   ANALYTICS        planned to do so.
                      The survey ran from June 22 to August 2,
                                                                         BI practitioners and
                    2011, and was publicized through several             BI experts.
 PLATFORMS FOR      channels. The BI Leadership Forum and
RUNNING BIG DATA
   ANALYTICS        BeyeNetwork sent several email broadcasts
                    to their lists. I tweeted about the survey and
                    asked followers to retweet the announcement. Several sponsors, including
PROFILING THE USE   Teradata, Infobright, and ParAccel, notified their customers about the survey
 OF ANALYTICAL
   PLATFORMS        through email broadcasts and newsletters.

                    I Respondent profile. Survey respondents are generally IT managers based in
 RECOMMENDA-        North America who work at large companies in a variety of industries (see
     TIONS
                    Figures 1-4, page 6). I




                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   5
RESEARCH BACKGROUND



                    Figure 1: Which best describes your position in BI?


                        VP/Director
                               Architect
                                Manager
                           Consultant
   EXECUTIVE
   SUMMARY                         Analyst
                    Administrator
                             Developer
   RESEARCH
  BACKGROUND
                                       Other

                                                            0                                        5                                      10                   15                          20                                     25                                     30

 WHY BIG DATA?


                    Figure 2: Where are you located?                                                                                                        Figure 3: What size is your
    BIG DATA                                                                                                                                                organization by revenues?
   ANALYTICS:
 DERIVING VALUE
 FROM BIG DATA
                    North America                        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66.7%       Large ($1B + revenues) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52.4%
                    Europe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16.5%             Medium ($50M to $1B)                          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24.8%

                    Other          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16.9%   Small (<$50M)               . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22.8%

 ARCHITECTURE
  FOR BIG DATA
   ANALYTICS
                    Figure 4: In what industry do you work?

 PLATFORMS FOR
RUNNING BIG DATA                                        Retail
   ANALYTICS                                Consulting
                                                   Banking
                                              Insurance
PROFILING THE USE
 OF ANALYTICAL                             Computers
   PLATFORMS
                     Telecommunications
                                                Software

 RECOMMENDA-                      Manufacturing
     TIONS                Health Care Payor
                          Hospitality/Travel
                                                        Other

                                                                          0                                5                              10                15                   20                           25                             30                              35



                                                 BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS                                                                                                                                                 6
WHY BIG DATA?




                    Why Big Data?

   EXECUTIVE        THERE HAS BEEN   a lot of talk about “big data” in the past year, which I find a bit
   SUMMARY
                    puzzling. I’ve been in the data warehousing field for more than 15 years, and
                    data warehousing has always been about big data.
   RESEARCH           Back in the late 1990s, I attended a ceremony honoring the Terabyte Club,
  BACKGROUND        a handful of companies that were storing more than a terabyte of raw data in
                    their data warehouses. Fast-forward more than 10 years and I could now be
                    attending a ceremony for the Petabyte Club. The trajectory of data acquisition
 WHY BIG DATA?
                    and storage for reporting and analytical applications has been steadily
                    expanding for the past 15 years.
    BIG DATA          So what’s new in 2011? Why are we
   ANALYTICS:       are talking about big data today? There
 DERIVING VALUE
 FROM BIG DATA      are several reasons:                               The growth in data is
                    1. Changing data types. Organizations             fueled by largely unstruc-
 ARCHITECTURE       are capturing different types of data             tured data from websites
  FOR BIG DATA
   ANALYTICS        today. Until about five years ago, most           and machine-generated
                    data was transactional in nature, con-
                    sisting of numeric data that fit easily
                                                                      data from an exploding
 PLATFORMS FOR      into rows and columns of relational               number of sensors.
RUNNING BIG DATA
   ANALYTICS        databases. Today, the growth in data is
                    fueled by largely unstructured data
                    from websites (e.g, Web traffic data
PROFILING THE USE   and social media content) as well as machine-generated data from an explod-
 OF ANALYTICAL
   PLATFORMS        ing number of sensors. Most of the new data is actually semi-structured in for-
                    mat, because it consists of headers followed by text strings. Pure unstructured
                    data, such as audio and video data, has limited textual content and is more dif-
 RECOMMENDA-        ficult to parse and analyze, but it is also growing (see Figure 5, page 8).
     TIONS


                    2. Technology advances. Hardware has finally caught up with software. The
                    exponential gains in price-performance exhibited by computer processors,
                    memory and disk storage have finally made it possible to store and analyze


                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   7
WHY BIG DATA?



                    large volumes of data at an affordable price. Database vendors have exploited
                    these advances by developing new high-speed analytical platforms designed
                    to accelerate query performance against large
                    volumes of data, while the open source com-
                    munity has developed Hadoop, a distributed
                    file management system designed to capture,          Organizations are
   EXECUTIVE        store and analyze large volumes of Web log
   SUMMARY                                                               storing and analyzing
                    data, among other things. In other words,
                    organizations are storing and analyzing more         more data because
   RESEARCH         data because they can.                               they can.
  BACKGROUND

                    3. Insourcing and outsourcing. Because of the
                    complexity and cost of storing and analyzing
 WHY BIG DATA?
                    Web traffic data, most organizations traditionally outsourced these functions
                    to third-party service bureaus like Omniture. But as the size and importance
    BIG DATA        of corporate e-commerce channels have increased, many are now eager to
   ANALYTICS:       insource this data to gain greater insights about customers. For example,
 DERIVING VALUE
 FROM BIG DATA




 ARCHITECTURE
                                                          Figure 5: Data growth
  FOR BIG DATA
   ANALYTICS




 PLATFORMS FOR
RUNNING BIG DATA
                       I Unstructured and content depot
   ANALYTICS
                       I Structured and replicated



PROFILING THE USE
 OF ANALYTICAL
   PLATFORMS




 RECOMMENDA-
     TIONS




                        2005         2006        2007         2008      2009              2010              2011             2012

                                                                      SOURCE: IDC DIGITAL UNIVERSE 2009: WHITE PAPER, SPONSORED BY EMC, 2009.




                               BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS                          8
WHY BIG DATA?



                    automobile valuation company Kelley Blue Book is now collecting and storing
                    Web traffic data in-house so it can combine that information with sales and
                    other corporate data to better understand customer behavior, according to
                    Dan Ingle, vice president of analytical insights and technology at the company.
                    At the same time, virtualization tech-
                    nology is beginning to make it attractive
   EXECUTIVE        for organizations to consider moving
   SUMMARY
                    large-scale data processing outside              ”We are the beginning
                    their data center walls to private hosted
                                                                      of an amazing world of
   RESEARCH         networks or public clouds.
  BACKGROUND                                                               data driven applications.
                    4. Developers discover data. The                       It's up to us to shape
                    biggest reason for the popularity of the               the world.”
 WHY BIG DATA?
                    term big data is that Web and applica-
                    tion developers have discovered the               —TIM O’REILLY,
                                                                       founder, O'Reilly Media
    BIG DATA        value of building new data-intensive
   ANALYTICS:       applications. To application developers,
 DERIVING VALUE
 FROM BIG DATA      big data is new and exciting. Tim
                    O’Reilly, founder of O’Reilly Media, a
                    longtime high-tech luminary and open source proponent, speaking at Hadoop
 ARCHITECTURE       World in New York in November 2010, said: "We are the beginning of an
  FOR BIG DATA
   ANALYTICS        amazing world of data-driven applications. It's up to us to shape the world." Of
                    course, for those of us who have made their careers in the data world, the new
                    era of “big data” is simply another step in the evolution of data management
 PLATFORMS FOR      systems that support reporting and analysis applications. I
RUNNING BIG DATA
   ANALYTICS




PROFILING THE USE
 OF ANALYTICAL
   PLATFORMS




 RECOMMENDA-
     TIONS




                            BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   9
BIG DATA ANALYTICS: DERIVING VALUE FROM BIG DATA




                    Big Data Analytics:
                    Deriving Value from Big Data
   EXECUTIVE
   SUMMARY


                    BIG DATA BY  itself, regardless of the type, is worthless unless business users
   RESEARCH         do something with it that delivers value to their organizations. That’s where
  BACKGROUND        analytics comes in. Although organizations have always run reports against
                    data warehouses, most haven’t opened these repositories to ad hoc explo-
                    ration. This is partly because analysis tools are too complex for the average
 WHY BIG DATA?
                    user but also because the repositories often don’t contain all the data needed
                    by the power user. But this is changing.
    BIG DATA
   ANALYTICS:       I Big vs. small data. A valuable characteristic of “big” data is that it contains
 DERIVING VALUE
 FROM BIG DATA      more patterns and interesting anomalies than “small” data. Thus, organiza-
                    tions can gain greater value by mining large data volumes than small ones.
                    While users can detect the patterns in small data sets using simple statistical
 ARCHITECTURE       methods, ad hoc query and analysis tools or by eyeballing the data, they need
  FOR BIG DATA
   ANALYTICS        sophisticated techniques to mine big data. Fortunately, these techniques
                    and tools already exist thanks to companies such as SAS Institute and SPSS
                    (now part of IBM) that ship analytical workbenches (i.e., data mining tools).
 PLATFORMS FOR      These tools incorporate all kinds of analytical algorithms that have been de-
RUNNING BIG DATA
   ANALYTICS        veloped and refined by academic and commercial researchers over the past
                    40 years.

PROFILING THE USE   I Real-time data. Organizations that accumulate big data recognize quickly
 OF ANALYTICAL
   PLATFORMS        that they need to change the way they capture, transform and move data from
                    a nightly batch process to a continuous process using micro batch loads or
                    event-driven updates. This technical constraint pays big business dividends
 RECOMMENDA-        because it makes it possible to deliver critical information to users in near real
     TIONS
                    time. In other words, big data fosters operational analytics by supporting just-
                    in-time information delivery. The market today is witnessing a perfect storm
                    with the convergence of big data, deep analytics and real-time information
                    delivery.


                            BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   10
BIG DATA ANALYTICS: DERIVING VALUE FROM BIG DATA



                    I Complex analytics. In addition, during the past 15 years, the “analytical IQ”
                    of many organizations has evolved from reporting and dashboarding to light-
                    weight analysis conducted with query and online analytical processing (OLAP)
                    tools. Many organizations are now on the verge of upping their analytical IQ
                    by implementing complex analytics
                    against both structured and unstruc-
   EXECUTIVE        tured data. Complex analytics spans a
   SUMMARY
                    vast array of techniques and applica-
                    tions. Traditional analytical workbenches         Analytics increases
   RESEARCH         from SAS and SPSS create mathematical             corporate intelligence. …
  BACKGROUND        models of historical data that can be             and is the only true
                    used to predict future behavior. This type
                                                                      source of sustainable
                    of predictive analytics can be used to do
 WHY BIG DATA?
                    everything from delivering highly tailored        advantage.
                    cross-sell recommendations to predict-
    BIG DATA        ing failure rates of aircraft engines. In
   ANALYTICS:       addition, organizations are now applying
 DERIVING VALUE
 FROM BIG DATA      a variety of complex analytics to Web, social media and other forms of com-
                    plex structured data that are hard to do with traditional SQL-based tools,
                    including path analysis, graph analysis, link analysis, fuzzy matching and
 ARCHITECTURE       so on.
  FOR BIG DATA
   ANALYTICS           Organizations are now recruiting analysts who know how to wield these
                    analytical tools to unearth the hidden value in big data. They are hiring analyti-
                    cal modelers who know how to use data mining workbenches, as well as data
 PLATFORMS FOR      scientists, application developers with process and data knowledge who write
RUNNING BIG DATA
   ANALYTICS        programming code to run against large Hadoop clusters.

                    I Sustainable advantage. At the same time, executives have recognized the
PROFILING THE USE   power of analytics to deliver a competitive advantage, thanks to the pioneer-
 OF ANALYTICAL
   PLATFORMS        ing work of thought leaders such as Tom Davenport, who co-wrote the book
                    Competing on Analytics: The New Science of Winning. In fact, forward-thinking
                    executives recognize that analytics may be the only true source of sustainable
 RECOMMENDA-        advantage since it empowers employees at all levels of an organization with
     TIONS
                    information to help them make smarter decisions. In essence, analytics
                    increases corporate intelligence, which is something you can never package or
                    systematize and competitors can’t duplicate. In short, many organizations
                    have laid the groundwork to reap the benefits of “big data analytics.”


                            BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   11
BIG DATA ANALYTICS: DERIVING VALUE FROM BIG DATA



                    A FRAMEWORK FOR SUCCESS
                    However, the road to big data analytics is not easy and success is not guaran-
                    teed. Analytical champions are still rare. That’s because succeeding with big
                    data analytics requires the right culture, people, organization, architecture and
                    technology (see Figure 6).

   EXECUTIVE        1. The right culture. Analytical organizations are championed by executives
   SUMMARY
                    who believe in making fact-based decisions or validating intuition with data.
                    These executives create a culture of performance measurement in which indi-
   RESEARCH         viduals and groups are held accountable for the outcomes of predefined met-
  BACKGROUND        rics aligned with strategic objectives. These leaders recruit other executives
                    who believe in the power of data and are willing to invest money and their own
                    time to create a learning organization that runs by the numbers and uses ana-
 WHY BIG DATA?
                    lytical techniques to exploit big data.

    BIG DATA        2. The right people. You can’t do big data analytics without power users, or
   ANALYTICS:
 DERIVING VALUE
 FROM BIG DATA


                                          Figure 6: Big data analytics framework
 ARCHITECTURE
  FOR BIG DATA
   ANALYTICS




 PLATFORMS FOR
RUNNING BIG DATA
   ANALYTICS




PROFILING THE USE
 OF ANALYTICAL
   PLATFORMS




 RECOMMENDA-
     TIONS




                            BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   12
BIG DATA ANALYTICS: DERIVING VALUE FROM BIG DATA



                    more specifically, business analysts, analytical modelers and data scientists.
                    These folks possess a rare combination of skills and knowledge: They have a
                    deep understanding of business processes and the data that sits behind those
                    processes and are skillful in the use of various analytical tools, including Excel,
                    SQL, analytical workbenches and coding languages. They are highly motivat-
                    ed, critical thinkers who command an above-average salary and exhibit a pas-
   EXECUTIVE        sion for success and deliver outsized value to the organization.
   SUMMARY


                    3. The right organization. Historically, analysts with the aforementioned skills
   RESEARCH         were pooled in pockets of an organization hired by department heads. But
  BACKGROUND        analytical champions create a shared service organization (i.e., an analytical
                    center of excellence) that makes analytics a pervasive competence. Analysts
                    are still assigned to specific departments and processes, but they are also part
 WHY BIG DATA?
                    of a central organization that provides collaboration, camaraderie and a career
                    path for analysts. At the same time, the director maintains a close relationship
    BIG DATA        with the data warehousing team (if he doesn’t own the function outright) to
   ANALYTICS:       ensure that business analysts have open access to the data they need to do
 DERIVING VALUE
 FROM BIG DATA      their jobs. Data is fuel for a business analyst or data scientist.

                    4. The right architecture. The data warehousing team plays a critical role in
 ARCHITECTURE       delivering deep analytics. It needs to establish an architecture that ensures the
  FOR BIG DATA
   ANALYTICS        delivery of high-quality, secure, consistent information while providing open
                    access to those who need it. Threading this needle takes wisdom, a good deal
                    of political astuteness and a BI-savvy data architecture team. The architecture
 PLATFORMS FOR      itself must be able to consume large volumes of structured and unstructured
RUNNING BIG DATA
   ANALYTICS        data and make it available to different classes of users via a variety of tools
                    (see “Architecture for Big Data Analytics” below).

PROFILING THE USE   5. Analytical platform. At the heart of an analytical infrastructure is an analyt-
 OF ANALYTICAL
   PLATFORMS        ical platform, the underlying data management system that consumes, inte-
                    grates and provides user access to information for reporting and analysis
                    activities. Today, many vendors, including most of the sponsors of this report,
 RECOMMENDA-        provide specialized analytical platforms that provide dramatically better query
     TIONS
                    performance than existing systems. There are many types of analytical plat-
                    forms sold by dozens of vendors (see “Types of Analytical Platforms” below). I




                            BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   13
ARCHITECTURE FOR BIG DATA ANALYTICS




                    Architecture for Big Data Analytics

   EXECUTIVE        IF BIG DATA is simply a continuation of longstanding data trends, does it change
   SUMMARY
                    the way organizations architect and deploy data warehousing environments?
                    Big data analytics doesn’t change data warehousing or BI architectures; it sim-
   RESEARCH         ply supplements them with new technologies and access methods better tai-
  BACKGROUND        lored to meeting the information requirements of business analysts and data
                    scientists.
 WHY BIG DATA?
                    I Top down. For the past 15 years, BI teams have built data warehouses that
                    serve the information needs of casual users (e.g., executives, managers and
    BIG DATA        front-line staff.) These top-down, report-driven environments require develop-
   ANALYTICS:       ers to know in advance what kinds of questions casual users want to ask and
 DERIVING VALUE
 FROM BIG DATA      which metrics they want to monitor. With requirements in hand, developers
                    create a data warehouse model, build extract, transform and load (ETL) rou-
                    tines to move data from source systems to the data warehouse, and then create
 ARCHITECTURE       reports and dashboards to query the data warehouse (see Figure 7, page 15).
  FOR BIG DATA
   ANALYTICS           Whether by choice or not, power users who operate in an exclusively top-
                    down BI environment are largely left to fend for themselves, using spread-
                    sheets, desktop databases, SQL and data-mining workbenches. Business ana-
 PLATFORMS FOR      lysts generally find BI tools too inflexible and data warehousing data too
RUNNING BIG DATA
   ANALYTICS        limited. At best, they might use BI tools as glorified extract engines to dump
                    data into Microsoft Excel, Access or some other analytical environment. The
                    upshot is that these analysts and data scientists generally spend an inordinate
PROFILING THE USE   amount of time preparing data instead of analyzing it and create hundreds if
 OF ANALYTICAL
   PLATFORMS        not thousands of data silos that wreak havoc on information consistency from
                    a corporate perspective.

 RECOMMENDA-        I Bottom-up. Business analysts and data scientists need a different type of
     TIONS
                    analytical environment, one that caters to their needs. This is a bottom-up en-
                    vironment that fosters ad hoc exploration of any data source, both inside and
                    outside corporate boundaries, and minimizes the need for analysts to create
                    data silos. Here, business analysts don’t know what questions they need to


                            BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   14
ARCHITECTURE FOR BIG DATA ANALYTICS



                    answer in advance because they are usually responding to emergency re-
                    quests from executives and managers who need information to address new
                    and unanticipated events in the marketplace. Rather than focus on goals and
                    metrics, business analysts spend most of their time engaged in ad hoc projects,
                    or they work closely with business managers to optimize existing processes.
                       As you can see, there is a world of difference between a top-down and
   EXECUTIVE        bottom-up BI environment. Many organizations have tried to support both
   SUMMARY
                    types of processing within a single BI environment. But that no longer works
                    in the age of big data analytics. Forward-thinking companies are expanding
   RESEARCH         their data warehousing architectures and data governance programs to better
  BACKGROUND        balance the dynamic between top-down and bottom-up requirements. (See
                    Analytic Architectures: Approaches to Supporting Analytics Users and Workloads,
                    a 40-page report by Wayne Eckerson, available for free download.)
 WHY BIG DATA?




    BIG DATA
   ANALYTICS:
                                           Figure 7: Top-down vs. bottom-up BI
 DERIVING VALUE            Top-down and bottom-up BI environments are distinct, but complementary,
 FROM BIG DATA
                      environments, but most organizations try to shoehorn both into a single architecture.


 ARCHITECTURE
  FOR BIG DATA
   ANALYTICS




 PLATFORMS FOR
RUNNING BIG DATA
   ANALYTICS




PROFILING THE USE
 OF ANALYTICAL
   PLATFORMS




 RECOMMENDA-
     TIONS




                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   15
ARCHITECTURE FOR BIG DATA ANALYTICS



                    NEXT-GENERATION BI ARCHITECTURE
                    Figure 8 represents the next-generation BI architecture, which blends ele-
                    ments from top-down and bottom-up BI into a single cohesive environment
                    that adequately supports both casual and power users. The top half of the dia-
                    gram represents the classic top-down, data warehousing architecture that pri-
                    marily delivers interactive reports and dashboards to casual users (although
   EXECUTIVE        the streaming/complex event processing (CEP) engine is new.) The bottom
   SUMMARY
                    half of the diagram adds new architectural elements and data sources that
                    better accommodate the needs of business analysts and data scientists and
   RESEARCH         make them full-fledged members of the corporate data environment.
  BACKGROUND




 WHY BIG DATA?

                                             Figure 8: The new BI architecture
                           The next-generation BI architecture is more analytical, giving power users
    BIG DATA
   ANALYTICS:
                        greater options to access and mix corporate data with their own data via various
 DERIVING VALUE          types of analytical sandboxes. It also brings unstructured and semi-structured
 FROM BIG DATA                 data fully into the mix using Hadoop and nonrelational databases.


 ARCHITECTURE
  FOR BIG DATA
   ANALYTICS




 PLATFORMS FOR
RUNNING BIG DATA
   ANALYTICS




PROFILING THE USE
 OF ANALYTICAL
   PLATFORMS




 RECOMMENDA-
     TIONS




                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   16
ARCHITECTURE FOR BIG DATA ANALYTICS



                    SERVER ENVIRONMENT
                    I Hadoop. The biggest change in the new BI architecture is that the data ware-
                    house is no longer the centerpiece. It now shares the spotlight with systems
                    that manage structured and unstructured data. The most popular among
                    these is Hadoop, an open source software framework for building data-inten-
                    sive applications. Following the example of Internet pioneers, such as Google,
   EXECUTIVE        Amazon and Yahoo, many companies now use Hadoop to store, manage and
   SUMMARY
                    process large volumes of Web data.
                       Hadoop runs on the Hadoop Distributed
   RESEARCH         File System (HDFS), a distributed file sys-
  BACKGROUND        tem that scales out on commodity servers.           The biggest change in
                    Since Hadoop is file-based, developers              the new BI architecture
                    don’t need to create a data model to store
 WHY BIG DATA?
                    or process data, which makes Hadoop ideal
                                                                        is that the data
                    for managing semi-structured Web data,              warehouse is no longer
    BIG DATA        which comes in many shapes and sizes. But           the centerpiece.
   ANALYTICS:       because it is “schema-less,” Hadoop can be
 DERIVING VALUE
 FROM BIG DATA      used to store and process any kind of data,
                    including structured transactional data and
                    unstructured audio and video data. However, the biggest advantage of Hadoop
 ARCHITECTURE       right now is that it’s open source, which means that the up-front costs of
  FOR BIG DATA
   ANALYTICS        implementing a system to process large volumes of data are lower than for
                    commercial systems. However, Hadoop does require companies to purchase
                    and manage dozens, if not hundreds, of servers and train developers and
 PLATFORMS FOR      administrators to use this new technology.
RUNNING BIG DATA
   ANALYTICS
                    I Data warehouse integration. Today some companies use Hadoop as a
                    staging area for unstructured and semi-structured data (e.g., Web traffic)
PROFILING THE USE   before loading it into a data warehouse. These companies keep the atomic
 OF ANALYTICAL
   PLATFORMS        data in Hadoop and push lightly summarized data sets to the data warehouse
                    or nonrelational systems for reporting and analysis. However, some compa-
                    nies let power users with appropriate skills query raw data in Hadoop.
 RECOMMENDA-          For example, LiveRail, an online video advertising service provider, follows
     TIONS
                    the lead of most Internet providers and uses both Hadoop and a data ware-
                    house to support a range of analytical needs. LiveRail keeps its raw Web data
                    from video campaigns in Hadoop and summarized data about those cam-
                    paigns in Infobright, a commercial, open source columnar database. Business


                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   17
ARCHITECTURE FOR BIG DATA ANALYTICS



                    users, including LiveRail’s customers who want to check on the performance
                    of their campaigns, run ad hoc queries and reports against Infobright, while
                    developers who need to access the raw data use Hive to schedule and run
                    reports against Hadoop.
                      “Since Hadoop isn’t interactive, we needed a fast database that scales to
                    support our ad hoc environment, which is why we chose Infobright,” said
   EXECUTIVE        Andrei Dunca, chief technology officer at LiveRail.
   SUMMARY

                    I Use cases for Hadoop. According to a new report by Ventana Research
   RESEARCH         titled Hadoop and Information Management: Benchmarking the Challenge of
  BACKGROUND        Enormous Volumes of Data, Hadoop is more likely to be used than traditional
                    data management systems for three purposes: to “perform types of analytics
                    that couldn’t be done on large volumes of data before,” “capture all the source
 WHY BIG DATA?
                    data we are collecting (pre-process)” and “keep more historical data (post-
                    process).” (See Figure 9.)
    BIG DATA           Once data lands in Hadoop, whether it’s Web data or not, organizations
   ANALYTICS:       have several options:
 DERIVING VALUE
 FROM BIG DATA




 ARCHITECTURE                                                    Figure 9: Role of Hadoop
  FOR BIG DATA
   ANALYTICS


                         Analyze data at a greater
 PLATFORMS FOR                      level of detail
RUNNING BIG DATA
   ANALYTICS
                       Perform types of analytics
                         that couldn’t be done on
                    large volumes of data before
PROFILING THE USE
 OF ANALYTICAL
   PLATFORMS            Keep more historical data
                                 (post-process)


 RECOMMENDA-              Capture all of the source                                                                                        I Hadoop
     TIONS              data that we are collecting                                                                                        I Non-
                                    (pre-process)                                                                                            Hadoop


                                                      0        10        20        30       40         50        60        70       80        90        100

                                       SOURCE: HADOOP AND INFORMATION MANAGEMENT: BENCHMARKING THE CHALLENGE OF ENORMOUS VOLUMES OF DATA: EXECUTIVE SUMMARY,
                                                                                                                             VENTANA RESEARCH, JUNE 23, 2011.




                                 BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS                                       18
ARCHITECTURE FOR BIG DATA ANALYTICS



                        R Create an online archive. With Hadoop, organizations don’t have to delete
                          or ship the data to offline storage; they can keep it online indefinitely by
                          adding commodity servers to meet storage and processing requirements.
                          Hadoop becomes a low-cost alternative for meeting online archival
                          requirements.

   EXECUTIVE            R Feed the data warehouse. Organizations can also use Hadoop to parse,
   SUMMARY
                          integrate and aggregate large volumes of Web or other types of data and
                          then ship it to the data warehouse, where both casual and power users can
   RESEARCH               query and analyze the data using familiar BI tools. Here, Hadoop becomes
  BACKGROUND              an ETL tool for processing large volumes of Web data before it lands in the
                          corporate data warehouse.
 WHY BIG DATA?
                        R Support analytics. The big data crowd (i.e., Internet developers) views
                          Hadoop primarily as an analytical engine for running analytical computa-
    BIG DATA              tions against large volumes of data. To query Hadoop, analysts currently
   ANALYTICS:             need to write programs in Java or other languages and understand
 DERIVING VALUE
 FROM BIG DATA            MapReduce, a framework for writing distributed (or parallel) applications.
                          The advantage here is that analysts aren’t restricted by SQL when formu-
                          lating queries. SQL does not support many types of analytics, especially
 ARCHITECTURE             those that involve inter-row calculations, which are common in Web traffic
  FOR BIG DATA
   ANALYTICS              analysis. The disadvantage is that Hadoop is batch-oriented and not con-
                          ducive to iterative querying.

 PLATFORMS FOR          R Run reports. Hadoop’s batch-orientation, however, makes it suitable for
RUNNING BIG DATA
   ANALYTICS              executing regularly scheduled reports. Rather than running reports against
                          summary data, organizations can now run them against raw data, guaran-
                          teeing the most accurate results.
PROFILING THE USE
 OF ANALYTICAL
   PLATFORMS
                    I Nonrelational databases. While Hadoop has received a lot of press atten-
                    tion lately, it’s not the only game in town for storing and managing semi-struc-
                    tured data. In fact, an emerging and diverse set of products goes one step fur-
 RECOMMENDA-        ther than Hadoop and stores both structured and unstructured data within a
     TIONS
                    single index. These so-called nonrelational databases (depicted in Figure 8
                    supporting a “free-standing sandbox”) typically extract entities from docu-
                    ments, files and other databases using natural language processing tech-
                    niques and index them as key value pairs for quick retrieval using a document-


                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   19
ARCHITECTURE FOR BIG DATA ANALYTICS



                    centric query language such as XQuery. As a result, these products can give
                    users one place to go to query both structured and unstructured data.
                       This style of analysis, which some call “unified information access,” exhibits
                    many search-like characteristics. But instead of returning a list of links, the
                    systems return qualified data sets or reports in response to user queries. And
                    unlike Hadoop, the systems are interactive, allowing users to submit queries in
   EXECUTIVE        an iterative fashion so they can under-
   SUMMARY
                    stand trends and issues.
                       These nonrelational systems comple-
   RESEARCH         ment Hadoop, an enterprise data ware-              Nonrelational systems
  BACKGROUND        house or both. For example, organiza-
                    tions might use Hadoop to transcribe               can store both structured
                    audio files and then load the transcrip-           and unstructured data
 WHY BIG DATA?
                    tions into a nonrelational database for            within a single index,
                    analysis. Or they might replicate sales
                    and customer data from a data ware-
                                                                       giving users one place to
    BIG DATA
   ANALYTICS:       house and combine it with Web data in a            query any type of data.
 DERIVING VALUE
 FROM BIG DATA      nonrelational database so power users
                    can find correlations between Web traffic
                    and customer orders without bogging
 ARCHITECTURE       down performance of the data warehouse with complex queries. This type of
  FOR BIG DATA
   ANALYTICS        unified information access is critical in a growing number of applications.
                       For example, an oil and gas company uses MarkLogic to track the location of
                    ships at sea. The MarkLogic Server stores data from GPS, news feeds, weather
 PLATFORMS FOR      data, commodity prices, among other things, and surfaces all this data on a
RUNNING BIG DATA
   ANALYTICS        map that users can query. For example, a user might ask, “Show me all the
                    ships within this polygon (i.e., geographic area) that are carrying this type of
                    oil and have changed course since leaving the port of origin.” The application
PROFILING THE USE   then displays the results on the map.
 OF ANALYTICAL
   PLATFORMS
                    I Data warehouse hubs. While Hadoop and nonrelational systems primarily
                    manage semi-structured and unstructured data, the data warehouse manages
 RECOMMENDA-        structured data from run-the-business operational systems. Except for Terada-
     TIONS
                    ta shops, many companies increasingly use data warehouses running on tradi-
                    tional relational databases as hubs to feed other systems and applications
                    rather than to host reporting and analysis applications.
                       For example, Dow Chemical, which maintains a large SAP Business Ware-


                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   20
ARCHITECTURE FOR BIG DATA ANALYTICS



                    house (BW) data warehouse, now runs all queries against virtual cubes that
                    run in memory using SAP BW Accelerator. “By running our cubes in memory,
                    we’ve de-bottlenecked our data warehouse,” said Mike Masciandaro, director
                    of business intelligence at Dow. “Now our
                    data warehouse primarily manages batch
                    loads to stage data.” Likewise, Blue Cross
   EXECUTIVE        Blue Shield of Kansas City has trans-         Many companies
   SUMMARY
                    formed its IBM DB2 data warehouse into
                                                                  increasingly use their data
                    a hub that feeds transaction and analyti-
                    cal systems and implemented a Teradata        warehouse as hubs to
   RESEARCH
  BACKGROUND        Data Warehouse Appliance 2650 to han-         feed other systems and
                    dle all reports and queries and support a     applications rather than
                    self-service BI environment.
 WHY BIG DATA?                                                               as targets for reporting
                    I Analytical sandboxes. In keeping with                  and analysis applications.
    BIG DATA        its role as a hub, an enterprise data
   ANALYTICS:       warehouse at many organizations now
 DERIVING VALUE
 FROM BIG DATA      distributes data to analytical sandboxes
                    that are designed to wean business analysts and data scientists off data shad-
                    ow systems and make them full-fledged consumers of the corporate data
 ARCHITECTURE       infrastructure. There are four types of analytical sandboxes:
  FOR BIG DATA
   ANALYTICS
                        R Hadoop. Hadoop can be considered an analytical sandbox for Web data
                          that developers with appropriate skills can access to run complex queries
 PLATFORMS FOR            and calculations. Rather than analyze summarized or transformed data in
RUNNING BIG DATA
   ANALYTICS              a data warehouse, developers can run calculations and models against the
                          raw, atomic data.

PROFILING THE USE       R Virtual DW sandbox. A virtual data warehouse sandbox is a partition, or
 OF ANALYTICAL
   PLATFORMS              set of tables, inside the data warehouse, dedicated to individual analysts.
                          Rather than create a spreadmart, analysts upload their data into a parti-
                          tion and combine it with data from the data warehouse that is either
 RECOMMENDA-              “pushed” to the partition by the BI team using ETL processes or “pulled”
     TIONS
                          by analysts through queries. The BI team carefully allocates compute
                          resources so analysts have enough horsepower to run ad hoc and complex
                          queries without interfering with other workloads running on the data
                          warehouse.


                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   21
ARCHITECTURE FOR BIG DATA ANALYTICS



                        R Free-standing sandbox. A free-standing sandbox is a separate system
                          from the data warehouse with its own server, storage and database that is
                          designed to support complex, analytical queries. In some cases, it is an
                          analytical platform that runs complex queries against a replica of the data
                          warehouse. In other cases, it runs on a nonrelational database and con-
                          tains an entirely new set of data (e.g., Web logs or sensor data) that either
   EXECUTIVE              doesn’t fit in the data warehouse because of space constraints or is
   SUMMARY
                          processed more efficiently in a nontraditional platform. Occasionally, it
                          mixes both corporate and departmental data into a local data mart that
   RESEARCH               can run on-premises or off-site in a hosted environment. In all cases, it
  BACKGROUND              provides a dedicated environment for a targeted group of analytically
                          minded users.
 WHY BIG DATA?
                        R In-memory BI sandbox. Some deskop BI tools, such as QlikView or Power-
                          Pivot, maintain a local data store in memory to support interactive dash-
    BIG DATA              boards or ad hoc queries. These sandboxes are popular among analysts,
   ANALYTICS:             because they generally let them pull data from any source, quickly link
 DERIVING VALUE
 FROM BIG DATA            data sets, run super-fast queries against data held in memory, and visually
                          interact with the results, all without much or any IT intervention. Also,
                          some server-based environments, such as SAP HANA, store all data in
 ARCHITECTURE             memory, accelerating queries for all types of BI users.
  FOR BIG DATA
   ANALYTICS
                    I Streaming/CEP Engine. The top-down environment picks one important
                    new architectural feature, streaming and CEP engines. Designed to support
 PLATFORMS FOR      continuous intelligence, CEP engines are designed to ingest large volumes of
RUNNING BIG DATA
   ANALYTICS        discrete events in real-time, calculate or correlate those events, enrich them
                    with historical data if needed, and apply rules that notify users when specific
                    types of activity or anomalies occur. For example, these engines are ideal for
PROFILING THE USE   detecting fraud in a stream of thousands of transactions per second.
 OF ANALYTICAL
   PLATFORMS           These rules-driven systems are like intelligent sensors that organizations
                    can attach to streams of transaction data to watch for meaningful combina-
                    tions of events or trends. In essence, CEP systems are sophisticated notifica-
 RECOMMENDA-        tion systems designed to monitor real-time events. They are ideal for monitor-
     TIONS
                    ing continuous operations, such as supply chains, transportation operations,
                    factory floors, casinos, hospital emergency rooms, Web-based gaming sys-
                    tems and customer contact centers.
                       Streaming engines are similar to CEP engines but are designed to handle


                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   22
ARCHITECTURE FOR BIG DATA ANALYTICS



                    enormous volumes of a single discrete event type, such as a sensor data gen-
                    erated by a pipeline or medical device. Streaming engines typically ingest an
                    order of magnitude more events per second than CEP engines but typically
                    only pull data from a single source. However, streaming and CEP engines are
                    merging in functionality as vendors seek to offer one-stop shopping for contin-
                    uous intelligence capabilities.
   EXECUTIVE
   SUMMARY


                    CLIENT ENVIRONMENT
   RESEARCH
                    I Casual users. The front-end of the new BI architecture remains relatively
  BACKGROUND        unchanged for casual users, who continue to use reports and dashboards run-
                    ning against dependent data marts (either logical or physical) fed by a data
                    warehouse. This environment typically meets 60% to 80% of their informa-
 WHY BIG DATA?
                    tion needs, which can be defined up-front through requirements-gathering
                    exercises. Predefined reports and dashboards are designed to answer ques-
    BIG DATA        tions tailored to individual roles within
   ANALYTICS:       the organization.
 DERIVING VALUE
 FROM BIG DATA         However, meeting the ad hoc needs
                    of casual users continues to be a prob-
                                                                     Search-based exploration
                    lem. Interactive reports and dash-
 ARCHITECTURE       boards help to some degree, but casual           tools that allow users
  FOR BIG DATA
   ANALYTICS        users today still rely on the IT depart-         to type queries in plain
                    ment or “super users”—tech-savvy
                                                                     English and refine their
                    business colleagues—to create ad hoc
 PLATFORMS FOR      reports and views on their behalf.               search using facets or
RUNNING BIG DATA
   ANALYTICS        Search-based exploration tools that              categories offer significant
                    allow users to type queries in plain Eng-        promise but are not yet
                    lish and refine their search using facets
PROFILING THE USE   or categories offer significant promise
                                                                     mainstream technology.
 OF ANALYTICAL
   PLATFORMS        but are not yet mainstream technology.
                       One new addition to the casual user
                    environment are dashboards powered
 RECOMMENDA-        by streaming/CEP engines. While these operational dashboards are primarily
     TIONS
                    used by operational analysts and workers, many executives and managers are
                    keen to keep their fingers on the pulse of their companies’ core processes by
                    accessing these “twinkling” dashboards directly or, more commonly, receiving
                    alerts from these systems inside existing BI environments.


                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   23
ARCHITECTURE FOR BIG DATA ANALYTICS



                    I Power users. The biggest change in the new analytical BI architecture is how
                    it accommodates the information needs of power users. It gives power users
                    many new options for consuming corporate data rather than creating count-
                    less spreadmarts. A power user is a person whose job is to crunch data on a
                    daily basis to generate insights and plans. Power users include business ana-
                    lysts (e.g., Excel jockeys), analytical modelers (e.g., SAS programmers and
   EXECUTIVE        statisticians) and data scientists (e.g., application developers with business
   SUMMARY
                    process and database expertise.) Power users have five options as depicted in
                    Figure 8:
   RESEARCH
  BACKGROUND            R Query a virtual sandbox. Rather than spend valuable time creating
                          spreadmarts, power users can leverage the processing power and data of
                          the data warehouse by using a virtual sandbox. Here, they can upload
 WHY BIG DATA?
                          their own data to the data warehouse, mix it with corporate data and per-
                          form their analyses. However, if they want to share something they’ve built
    BIG DATA              in their sandboxes, they need to hand over their analyses to the BI team to
   ANALYTICS:             turn it into production applications.
 DERIVING VALUE
 FROM BIG DATA
                        R Query a free-standing sandbox. Power users can also query a free-stand-
                          ing sandbox created for their benefit. In most cases, the system is tuned to
 ARCHITECTURE             support ad hoc queries and analytical modeling activities against a replica
  FOR BIG DATA
   ANALYTICS              of the data warehouse or another data set designed for power users.

                        R Query a BI sandbox. Power users can download data from a data ware-
 PLATFORMS FOR            house or other source into a local BI tool and interact with the data in
RUNNING BIG DATA
   ANALYTICS              memory at the speed of thought. While these sandboxes have the poten-
                          tial to become spreadmarts, new analytical tools usually bake in server
                          environments that encourage, if not require, power users to publish their
PROFILING THE USE         analyses to an IT-controlled environment. Many also are starting to come
 OF ANALYTICAL
   PLATFORMS              up with collaboration capabilities that encourage reuse and minimize the
                          proliferation of data silos.

 RECOMMENDA-            R Query the data warehouse. Some BI teams give permission to a handful of
     TIONS
                          trusted power users to directly query the data warehouse or DW staging
                          area. This requires that analysts have a deep understanding of the raw
                          data and advanced knowledge of SQL to avoid creating runaway queries or
                          generating incorrect results.


                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   24
ARCHITECTURE FOR BIG DATA ANALYTICS



                        R Query Hadoop. If power users want to analyze big data in its raw or lightly
                          aggregated form, they can query Hadoop directly by writing MapReduce
                          code in a variety of languages. However, power users must know how to
                          write parallel queries and interrogate the structure of the data prior to
                          querying it since Hadoop data is schema-less. Vendors are also beginning
                          to ship BI tools that access Hadoop through Hive or Hbase and return data
   EXECUTIVE              sets to the BI tool.
   SUMMARY

                    I Data integration. The new BI architecture also places a premium on man-
   RESEARCH         aging and manipulating data flows between systems. This calls for a versatile
  BACKGROUND        set of data integration tools that can access any type of data (e.g., structured,
                    semi-structured and unstructured), load it into any target (e.g., Hadoop, data
                    warehouse or in-memory database),
 WHY BIG DATA?
                    navigate data sources that exist on-
                    premises or in the cloud, work in
    BIG DATA        batch and real time, and handle both           Data integration products
   ANALYTICS:       small and large volumes of data.               that run on both relational
 DERIVING VALUE
 FROM BIG DATA         Data integration tools for Hadoop
                    are in their infancy but evolving fast.        and nonrelational platforms
                    The open source community has                  and maintain a consistent
 ARCHITECTURE       developed Flume, a scalable distrib-           set of metadata across
  FOR BIG DATA
                    uted file system that collects, aggre-
   ANALYTICS
                                                                   both environments will
                    gates and loads data into the HDFS.
                    But longtime data integration ven-             reduce overall training
 PLATFORMS FOR
RUNNING BIG DATA
                    dors, such as Informatica, are also            and maintenance costs.
   ANALYTICS        converting their visual design tools
                    to interoperate with Hadoop. That
                    way, ETL developers can use familiar
PROFILING THE USE   tools to extract, load, parse, integrate, cleanse and match data in Hadoop by
 OF ANALYTICAL
   PLATFORMS        generating MapReduce code under the covers.
                       In this respect, Hadoop is both another data source for ETL tools as well as a
                    new data processing engine geared to handling semi-structured and unstruc-
 RECOMMENDA-        tured data. Data integration products that run on both relational and nonrela-
     TIONS
                    tional platforms and maintain a consistent set of metadata across both envi-
                    ronments will reduce overall training and maintenance costs.

                    I   Analytical services. Although it doesn’t happen often, a growing number of


                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   25
ARCHITECTURE FOR BIG DATA ANALYTICS



                    companies are outsourcing analytical applications to a third-party provider
                    and occasionally an entire data warehouse. The most popular applications to
                    outsource to a third party are test, development and prototyping applications.
                    Often, IT administrators will provision servers on demand from a public cloud
                    to support these types of applications. However, if an organization wants a
                    permanent online presence to support an analytical sandbox or data ware-
   EXECUTIVE        house, it will subscribe to a private hosted service, which provides higher lev-
   SUMMARY
                    els of guaranteed availability and performance compared with a public cloud.
                      In either case, the motivation to implement an analytical service is straight-
   RESEARCH         forward: An analytical service requires minimal IT involvement and up-front
  BACKGROUND        capital, making it easy, quick and painless to get up and running quickly. Ironi-
                    cally, few companies think of outsourcing their analytical environments when
                    exploring options.
 WHY BIG DATA?

                    I Dollar General. That was the case with Dollar General, a discount retailer
    BIG DATA        that wanted to purchase an analytical system to supplement its Oracle data
   ANALYTICS:       warehouse, which could not store atomic-level point-of-sale (POS) data from
 DERIVING VALUE
 FROM BIG DATA      its 9,500 stores nationwide. With a reference from a consumer products part-
                    ner, Dollar General decided to implement an analytical platform from services
                    provider, 1010data. The product, which is accessed via a Web browser, offers
 ARCHITECTURE       an Excel-like interface that provides native support for time-series data and
  FOR BIG DATA
   ANALYTICS        analytical functions. Within five weeks, Dollar General was running daily
                    reports against atomic-level POS data, according to Sandy Steier, executive
                    vice president and co-founder of 1010data.
 PLATFORMS FOR         A year later, Dollar General decided to replace its Oracle data warehouse
RUNNING BIG DATA
   ANALYTICS        and conducted a proof of concept with several leading analytical platform
                    providers. 1010data, which participated in the Bake-Off, demonstrated superi-
                    or performance and now runs Dollar General’s entire data warehouse.
PROFILING THE USE      And while Dollar General didn’t set out to purchase an analytical service,
 OF ANALYTICAL
   PLATFORMS        that proved a smart move. Besides quick deployment times and reduced inter-
                    nal maintenance costs, the analytical service made it easier for Dollar General
                    to open up its data warehouse to suppliers, which now use it to track sales and
 RECOMMENDA-        make recommendations for product placement and promotions, Steier said. I
     TIONS




                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   26
PLATFORMS FOR RUNNING BIG DATA ANALYTICS




                    Platforms for Running
                    Big Data Analytics
   EXECUTIVE
   SUMMARY


                    SINCE THE BEGINNING    of the data warehousing movement in the early 1990s,
   RESEARCH         organizations have used general-purpose data management systems to imple-
  BACKGROUND        ment data warehouses and, occasionally, multidimensional databases (i.e.,
                    “cubes”) to support subject-specific data marts, especially for financial analyt-
                    ics. General-purpose data management systems were designed for transac-
 WHY BIG DATA?
                    tion processing (i.e., rapid, secure, syn-
                    chronized updates against small data
    BIG DATA        sets) and only later modified to handle
   ANALYTICS:       analytical processing (i.e., complex            Analytical platforms focus
 DERIVING VALUE
 FROM BIG DATA      queries against large data sets.) In con-       entirely on analytical
                    trast, analytical platforms focus entirely
                                                                    processing at the expense
                    on analytical processing at the expense
 ARCHITECTURE       of transaction processing.1                     of transaction processing.
  FOR BIG DATA
   ANALYTICS
                    I The analytical platform movement.
                    In 2002, Netezza (now owned by IBM),
 PLATFORMS FOR      introduced a specialized analytical appliance, a tightly integrated, hardware-
RUNNING BIG DATA
   ANALYTICS        software database management system designed explicitly to run ad hoc
                    queries at blindingly fast speeds. Netezza’s success spawned a host of com-
                    petitors, and there are now more than two dozen players in the market. The
PROFILING THE USE   value of this new analytical technology didn’t escape the notice of the world’s
 OF ANALYTICAL
   PLATFORMS        biggest software vendors, each of whom has made a major investment in the
                    space, either through organic development or an acquisition (see Table 1).
                      To be accurate, Netezza wasn’t the first mover in the market, but it came
 RECOMMENDA-        along at the right time. In the mid-2000s, many corporate BI teams were
     TIONS



                    1
                    Like most things, there are exceptions to this rule. For example, Oracle Exadata runs on Oracle 10g and, as such, it supports both
                    transactional and analytical processing, often with superior performance in both realms compared with standard Oracle 10g
                    installations.




                                BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS                                      27
PLATFORMS FOR RUNNING BIG DATA ANALYTICS




                    Table 1: Types of analytical platforms
                    (Companies in parentheses recently acquired the preceding product or company)


                    TECHNOLOGY              DESCRIPTION                               VENDOR/PRODUCT


                    Massively parallel      Row-based databases designed to           Teradata Active Data Warehouse,
                    processing analytic     scale out on a cluster of commodity       Greenplum (EMC), Microsoft
   EXECUTIVE
                    databases               servers and run complex queries in        Parallel Data Warehouse, Aster
   SUMMARY
                                            parallel against large volumes of         Data (Teradata), Kognitio, Dataupia
                                            data.

   RESEARCH         Columnar                Database management systems               ParAccel, Infobright, Sand
  BACKGROUND        databases               that store data in columns, not rows,     Technology, Sybase IQ (SAP),
                                            and support high data compression         Vertica (Hewlett-Packard),
                                            ratios.                                   1010data, Exasol, Calpont

 WHY BIG DATA?

                    Analytical              Preconfigured hardware-software           Netezza (IBM), Teradata Appli-
                    appliances              systems designed for query                ances, Oracle Exadata, Greenplum
    BIG DATA                                processing and analytics that             Data Computing Appliance (EMC)
   ANALYTICS:                               require little tuning.
 DERIVING VALUE
 FROM BIG DATA
                    Analytical bundles      Predefined hardware and software          IBM SmartAnalytics, Microsoft
                                            configurations that are certified to      FastTrack
                                            meet specific performance criteria,
                                            but customers must purchase and
 ARCHITECTURE
                                            configure themselves.
  FOR BIG DATA
   ANALYTICS
                    In-memory               Systems that load data into memory        SAP HANA, Cognos TM1 (IBM),
                    databases               to execute complex queries.               QlikView, Membase

 PLATFORMS FOR
RUNNING BIG DATA    Distributed file-       Distributed file systems designed         Hadoop (Apache, Cloudera, MapR,
   ANALYTICS        based systems           for storing, indexing, manipulating       IBM, HortonWorks), Apache Hive,
                                            and querying large volumes of un-         Apache Pig
                                            structured and semi-structured data.

PROFILING THE USE
                    Analytical services     Analytical platforms delivered            1010data, Kognitio
 OF ANALYTICAL
                                            as hosted or public-cloud-based
   PLATFORMS
                                            services.


                    Nonrelational           Nonrelational databases optimized         MarkLogic Server, MongoDB,
 RECOMMENDA-                                for querying unstructured data as         Splunk, Attivio, Endeca, Apache
     TIONS                                  well as structured data.                  Cassandra, Apache Hbase


                    CEP/streaming           Ingest, filter, calculate and correlate   IBM, Tibco, Streambase,
                    engines                 large volumes of discrete events and      Sybase (Aleri), Opalma, Vitria,
                                            apply rules that trigger alerts when      Informatica
                                            conditions are met.



                              BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS           28
PLATFORMS FOR RUNNING BIG DATA ANALYTICS



                    struggling to deliver reasonable query performance using general-purpose
                    data management systems as the volumes of data and numbers of users and
                    applications running against their data warehouses exploded. Netezza offered
                    a convenient way to offload long-running or complex queries from data ware-
                    houses and satisfy the needs of business analysts. Yet long before Netezza
                    shipped its first box, in the 1980s, Teradata delivered the first massively paral-
   EXECUTIVE        lel database management system geared to analytical processing, and Sybase
   SUMMARY
                    shipped the first columnar database in the 1990s. Both of these products now
                    have thousands of customers and, in this respect, can be considered front-
   RESEARCH         runners in the so-called analytical platform market.
  BACKGROUND           Today, the technology behind analytical platforms is diverse: appliances,
                    columnar databases, in memory databases, massively parallel processing
                    (MPP) databases, file-based systems, nonrelational databases and analytical
 WHY BIG DATA?
                    services. What they all have in common, however, is that they provide signifi-
                    cant improvements in price-performance, availability, load times and man-
    BIG DATA        ageability compared with general-purpose relational database management
   ANALYTICS:       systems. Every analytical platform customer I’ve interviewed has cited order-
 DERIVING VALUE
 FROM BIG DATA      of-magnitude performance gains that most initially don’t believe. “The per-
                    formance is blinding; just amazing,” says Masciandaro of Dow.

 ARCHITECTURE       IAnalytical Techniques. Analytical platforms offer superior price-perfor-
  FOR BIG DATA
   ANALYTICS        mance for many reasons. And while product architectures vary considerably,
                    most support the following characteristics:

 PLATFORMS FOR          R MPP. Most analytical platforms spread data across multiple nodes, each
RUNNING BIG DATA
   ANALYTICS              containing its own CPU, memory and storage and connected to a high-
                          speed backplane. When a user submits a query or runs an application, the
                          “shared nothing” system divides the work across the nodes, each of which
PROFILING THE USE         process the query on its piece of the data and ship the results to a master
 OF ANALYTICAL
   PLATFORMS              node that assembles the final result and sends it to the user. MPP systems
                          are highly scalable, since you simply add nodes to increase processing
                          power. And if the nodes run on commodity servers, as many MPP systems
 RECOMMENDA-              today do, then this scalability is more cost-effective than MPP systems
     TIONS
                          running on proprietary hardware or symmetric multiprocessing systems,
                          which require big, expensive boxes to scale.

                        R Balanced configurations. Analytical platforms optimize the configuration


                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   29
PLATFORMS FOR RUNNING BIG DATA ANALYTICS



                     of CPU, memory and disk for query processing rather than transaction
                     processing. Analytical appliances essentially “hard wire” this configuration
                     into the system and don’t let customers change it, whereas analytical bun-
                     dles or analytical databases (i.e., software-only solutions) allow customers
                     to configure the underlying hardware to match unique application require-
                     ments. Analytical appliances offer convenience and ease of use while ana-
   EXECUTIVE         lytical databases offer flexibility.
   SUMMARY


                    R Storage-level processing. Netezza’s big innovation was to move some
   RESEARCH           database functions, specifically data filtering functions, into the storage
  BACKGROUND          system using field programmable gate arrays. This storage-level filtering
                      reduces the amount of data that the DBMS has to process, which signifi-
                      cantly increases query performance. Many vendors have followed suit,
 WHY BIG DATA?
                      moving various databases functions into hardware. In fact, Kickfire (pur-
                      chased by Teradata) runs all SQL functions on a chip to accelerate query
    BIG DATA          processing.
   ANALYTICS:
 DERIVING VALUE
 FROM BIG DATA      R Columnar storage and compression. Many vendors have followed the
                      lead of Sybase, Sand Technology, ParAccel, and other columnar pioneers,
                      by storing data in columns, not rows. Since most queries ask for a subset
 ARCHITECTURE         of columns in a row (i.e., the “where” clause) rather than all rows, storing
  FOR BIG DATA
   ANALYTICS          data in columns minimizes the amount of data that needs to be retrieved
                      from disk and processed by the database, accelerating query performance.
                      In addition, since data elements in many columns are repeated (e.g.,
 PLATFORMS FOR        “male” and “female” in the gender field), column-store systems can elimi-
RUNNING BIG DATA
   ANALYTICS          nate duplicates and compress data volumes significantly, sometimes as
                      much as 10:1. This enables more data to fit into memory, which speeds
                      processing and minimizes the amount of disk required to store data, mak-
PROFILING THE USE     ing the systems more cost-effective.
 OF ANALYTICAL
   PLATFORMS
                    R Memory. Many analytical platforms make liberal use of memory caches
                      to speed query processing. Some products, such as SAP HANA and Qlik-
 RECOMMENDA-          Tech’s QlikView, store all data in-memory, while others store recently
     TIONS
                      queried results in a smart cache so others who need to retrieve the same
                      data can pull it from memory rather than from disk. Given the growing
                      affordability of memory and the widespread deployment of 64-bit operat-
                      ing systems, which lift constraints on the amount of data that can be held


                         BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   30
PLATFORMS FOR RUNNING BIG DATA ANALYTICS



                         in memory, many analytical platforms are expanding their memory foot-
                         prints to speed query processing.

                        R Query optimizer. Analytical platform vendors invest a lot of time and
                          money researching ways to enhance their query optimizers to handle vari-
                          ous workloads. A good query optimizer is perhaps the biggest contributor
   EXECUTIVE              to query performance. In this respect, the older vendors with established
   SUMMARY
                          products have an edge.

   RESEARCH             R Plug-in analytics. True to their name, many analytical platforms offer
  BACKGROUND              built-in support for complex analytics. This includes complex SQL, such as
                          correlated subqueries, as well as procedural code implemented as plug-ins
                          to the database. Some vendors offer a library of analytical routines, from
 WHY BIG DATA?
                          fuzzy matching algorithms to market-basket calculations. Some, like Aster
                          Data (now owned by Teradata), provide native support for MapReduce
    BIG DATA              programs that are called using SQL.
   ANALYTICS:
 DERIVING VALUE
 FROM BIG DATA
                    I Hadoop and NoSQL. Some may argue whether Hadoop and the nonrelation-
                    al databases are analytical platforms. While they don’t store data in rows and
                    columns, both are well-suited to process large volumes of data for analytical
 ARCHITECTURE       purposes. And most use an MPP architecture that scales out on commodity
  FOR BIG DATA
   ANALYTICS        servers. And some, such as MarkLogic, are full-fledged databases that support
                    transactional integrity.
                       Hadoop in particular differs significantly from most analytical platforms. As
 PLATFORMS FOR      a batch system, it’s not focused on optimizing query performance like other
RUNNING BIG DATA
   ANALYTICS        analytical platforms, and thus, does not implement many of the characteristics
                    in the above bulleted list. However, Hadoop’s biggest value is that it’s open
                    source and so can process large volumes of data in a cost-effective way. And
PROFILING THE USE   like many nonrelational systems, it is schema-less, giving administrators
 OF ANALYTICAL
   PLATFORMS        greater flexibility to change data structures without having to spend weeks or
                    months rewriting a data model. I

 RECOMMENDA-
     TIONS




                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   31
PROFILING THE USE OF ANALYTICAL PLATFORMS




                    Profiling the Use of
                    Analytical Platforms
   EXECUTIVE
   SUMMARY


                    NOW THAT WE    understand the business context for analytical platforms, the
   RESEARCH         technical architecture in which they run and their technical characteristics, we
  BACKGROUND        can profile their use in user organizations. To do this, I conducted a survey of
                    BI professionals and asked them to describe their use of analytical platforms
                    from a business and technical perspective. The survey provided respondents
 WHY BIG DATA?
                    with the following definition of an analytical platform:

    BIG DATA            An analytical platform is a data management system optimized for
   ANALYTICS:           query processing and analytics that provides superior price-performance
 DERIVING VALUE
 FROM BIG DATA          and availability compared with general purpose database management
                        systems.

 ARCHITECTURE          Given this definition, almost three-quarters (72%) of our survey respon-
  FOR BIG DATA
   ANALYTICS        dents said that they had purchased or implemented an analytical database.
                       While the growth of the analytical platform market has been strong, this
                    72% figure seems a tad high, given that a majority of analytical database
 PLATFORMS FOR      products have been on the market for less than five years. Upon closer investi-
RUNNING BIG DATA
   ANALYTICS        gation, despite our definition, a sizable number of respondents when asked to
                    name their analytical platform identified a general-purpose database, in par-
                    ticular Microsoft SQL Server and Oracle (non-Exadata). Regardless, the data
PROFILING THE USE   still shows that many companies are turning to specialized analytical plat-
 OF ANALYTICAL
   PLATFORMS        forms to better meet their analytical requirements.

                    I Non-customers. Among respondents that haven’t purchased an analytical
 RECOMMENDA-        platform, 46% have no plans to do so, 42% are exploring the idea and just
     TIONS
                    12% are currently evaluating vendors. On the whole, about 75% of respon-
                    dents will have an analytical platform in the near future (see Figure 10, page
                    33).



                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   32
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    Figure 10: Do you plan to purchase or implement an analytical platform?
                               (Asked of respondents who don’t yet have an analytical platform)


                               No plans
                              Exploring
                    Currently evaluating
   EXECUTIVE
   SUMMARY
                                           0          10             20             30             40               50



   RESEARCH
  BACKGROUND

                    DEPLOYMENT OPTIONS
                    Our survey grouped analytical platforms into four major categories to make it
 WHY BIG DATA?
                    easier to compare and contrast various product offerings:

    BIG DATA        1. Analytical databases: They can be described as software-only analytical
   ANALYTICS:       platforms that run on a variety of hardware that customers purchase. Cus-
 DERIVING VALUE
 FROM BIG DATA      tomers install, configure and tune software, including the analytical database,
                    before they can use the analytical system. Most MPP analytical databases,
                    columnar databases and in-memory databases listed in Table 1 qualify as ana-
 ARCHITECTURE       lytical databases.
  FOR BIG DATA
   ANALYTICS
                    2. Analytical appliances: These are hardware-software combinations
                    designed to support ad hoc queries and other types of analytical processing.
 PLATFORMS FOR      Analytical appliances tightly integrate the hardware and software, often using
RUNNING BIG DATA
   ANALYTICS        proprietary components, to optimize performance and minimize the need for
                    tuning. Analytical bundles, which consist of standalone hardware and soft-
                    ware products that a vendor ships as a package, also qualify as analytical
PROFILING THE USE   appliances. Bundles give administrators more flexibility to tune the system but
 OF ANALYTICAL
   PLATFORMS        sacrifice deployment speed and manageability.

                    3. Analytical services: Rather than deploy an analytical platform in a cus-
 RECOMMENDA-        tomer’s data center, an analytical service enables customers to house the sys-
     TIONS
                    tem in an off-site hosted environment or public cloud. This eliminates up-front
                    capital expenditures and lessens maintenance.

                    4. File-based analytical system: This generally refers to Hadoop, but we also


                              BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS    33
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    Figure 11: Analytical platform deployment options:
                               Which have you purchased and implemented?


                           Analytical appliance
                            Analytical database
                    File-based analytical system
   EXECUTIVE
   SUMMARY                    Analytical service

                                                   0       10           20            30            40              50

   RESEARCH
  BACKGROUND

                    lumped NoSQL or nonrelational systems into this category, although it’s not
                    entirely accurate, since nonrelational systems are databases. However, since
 WHY BIG DATA?
                    both are used to store and analyze large volumes of unstructured data and
                    don’t’ require an up-front schema design, they share more similarities than dif-
    BIG DATA        ferences.
   ANALYTICS:
 DERIVING VALUE
 FROM BIG DATA         Given these categories, most analytical platform customers have either pur-
                    chased or implemented analytical databases (46%) or analytical appliances
                    (49%). Many fewer have implemented a file-based analytical system (10%)
 ARCHITECTURE       or analytical service (5%). (See Figure 11.)
  FOR BIG DATA
   ANALYTICS           Looking under the covers, analytical database customers are most likely to
                    have purchased Microsoft SQL Server or Oracle, while appliance customers
                    have purchased Teradata Active DW, a Teradata Appliance, or Netezza. Ana-
 PLATFORMS FOR      lytical services customers subscribed to a host of different vendors, while cus-
RUNNING BIG DATA
   ANALYTICS        tomers of file-based analytical systems were most likely to purchase a Hadoop
                    distribution from Cloudera, Apache or EMC Greenplum.

PROFILING THE USE
 OF ANALYTICAL
   PLATFORMS        DEPLOYMENT STATUS
                    Drilling into each category further, we find that most of the respondents who
                    have purchased an analytical platform of some type have also deployed the
 RECOMMENDA-        system. Roughly three-quarters of customers with analytical databases (73%)
     TIONS
                    and slightly more customers of analytical appliances (80%) have deployed
                    their systems. Not surprisingly, 100% of analytical services customers have
                    deployed their systems, but only 33% of customers with file-based analytical
                    systems have implemented theirs (see Table 2, page 35).


                              BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS    34
PROFILING THE USE OF ANALYTICAL PLATFORMS




                    Table 2: Status of deployment options
                                                ANALYTICAL       ANALYTICAL        ANALYTICAL        FILE-BASED
                                                 DATABASE        APPLIANCE          SERVICE       ANALYTICAL SYSTEM

                    Percentage deployed            72%              81%              100%               33%

                    Average years deployed         4.0               4.9               3                 1.3
   EXECUTIVE
   SUMMARY




   RESEARCH         With an analytical service, you simply create a data model (which in many
  BACKGROUND        cases is optional) and load your data either by using the Internet or shipping a
                    disk to the provider, and the provider takes care of the rest. Thus, it’s much
                    easier to deploy an analytical service than the other options, accounting for
 WHY BIG DATA?
                    the 100% deployment figure in Table 2.
                       Analytical appliances generally take less time to deploy than analytical data-
    BIG DATA        bases, which may account for the slightly higher deployment percentage. Ana-
   ANALYTICS:       lytical databases require customers to purchase and install hardware, which
 DERIVING VALUE
 FROM BIG DATA      may take many months and require multiple sign-offs from the IT, legal and
                    purchasing departments. Despite overwhelming press coverage of Hadoop,
                    few companies have implemented the system. Among those that have, most
 ARCHITECTURE       are largely experimenting, which explains the low deployment percentage
  FOR BIG DATA
   ANALYTICS        compared with the other options.
                       The figures for “average years deployed” tell a similar story. As the new kid
                    on the block, Hadoop systems have only been deployed for an average of 1.3
 PLATFORMS FOR      years, followed by analytical services, which have been deployed an average
RUNNING BIG DATA
   ANALYTICS        of three years. In contrast, analytical appliances have been deployed for 4.9
                    years and analytical databases for 4.0 years.

PROFILING THE USE
 OF ANALYTICAL
   PLATFORMS        TECHNICAL DRIVERS
                    When examining the business requirements driving purchases of analytical
                    platforms overall, three percolate to the top: “faster queries,” “storing more
 RECOMMENDA-        data” and “reduced costs.” These requirements are followed by “more com-
     TIONS
                    plex queries,” “higher availability” and “quicker to deploy.” This ranking is
                    based on summing the percentages of all four deployment options for each
                    requirement (see Figure 12, page 36).
                      More important, Figure 12 shows that customers purchase each deployment


                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   35
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    Figure 12: Business requirements by deployment option
                    (Sorted from most to least for the percentage total of all four deployment options)



                            Faster queries


   EXECUTIVE
   SUMMARY
                         Stores more data



   RESEARCH
                           Reduced costs
  BACKGROUND




                    More complex queries
 WHY BIG DATA?




    BIG DATA
                        Higher availability
   ANALYTICS:
 DERIVING VALUE
 FROM BIG DATA
                        Quicker to deploy
                                                                                          I Analytical database
                                                                                          I Analytical appliance
                                                                                          I Analytical service
 ARCHITECTURE                                                                             I File-based analytical system
  FOR BIG DATA        Easier maintenance
   ANALYTICS


                         Faster load times
 PLATFORMS FOR
RUNNING BIG DATA
   ANALYTICS
                        More diverse data


PROFILING THE USE
 OF ANALYTICAL       More flexible schema
   PLATFORMS



                    More concurrent users
 RECOMMENDA-
     TIONS

                         Built-in analytics




                                              0   10    20     30     40     50     60     70       80      90      100



                              BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS      36
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    option for different reasons. Analytical database customers value “quicker to
                    deploy” (46%), “built-in analytics” (43%) and “easier maintenance” (41%)
                    more than other requirements, while analytical service customers favor “stor-
                    ing more data” (67%), “high availability” (67%), “reduced costs” (56%) and
                    “more concurrent users” (56%). Not surprisingly, customers with file-based
                    systems look for the ability to support “more diverse data” (64%) and “more
   EXECUTIVE        flexible schemas” (64%), two hallmarks of
   SUMMARY
                    a Hadoop/NoSQL offering.
                       Analytical appliance customers had the
   RESEARCH         most emphatic requirements. Almost two-             Analytical appliance
  BACKGROUND        thirds value faster queries (70%), more             customers had the most
                    complex queries (64%) and faster load
                    times (63%), suggesting that analytical             emphatic requirements.
 WHY BIG DATA?
                    appliance customers seek to offload com-            Almost two-thirds value
                    plex ad hoc queries from data warehouses.           faster queries, more
    BIG DATA           This is exactly the reason that Blue
   ANALYTICS:       Cross Blue Shield of Kansas City pur-
                                                                        complex queries and
 DERIVING VALUE
 FROM BIG DATA      chased a Teradata Data Warehouse Appli-             faster load times.
                    ance 2650 and MicroStrategy analysis
                    tools. The company plans to push about
 ARCHITECTURE       1 TB of data from its IBM DB2 data ware-
  FOR BIG DATA
   ANALYTICS        house to the Teradata appliance to support a self-service BI environment for
                    about 300 users. “We want executives and managers to be able to get data
                    and make decisions without depending on the IT department,” says Darren
 PLATFORMS FOR      Taylor, vice president of enterprise analytics and data management. Taylor
RUNNING BIG DATA
   ANALYTICS        said the key requirements for the system were query performance and the
                    ability to support complex analytical models and advanced visualization tech-
                    niques, which will be embedded in the self-service solution.
PROFILING THE USE      Many companies also offload analytical processing to analytical databases.
 OF ANALYTICAL
   PLATFORMS        For example, a large U.S. retailer recently offloaded complex analytical queries
                    from its maxed-out Teradata data warehouse to ParAccel, a high-performance
                    columnar database. The company chose a software-only system so it could
 RECOMMENDA-        implement the database in a private cloud and spawn new instances in
     TIONS
                    response to user demand, a key requirement that an analytical appliance does
                    not support. The customer also implemented a direct connection between the
                    two systems using Teradata’s parallel FastExport wire protocol, eliminating the
                    need for the customer to expand its ETL footprint, saving considerable time


                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   37
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    Figure 13: Were you explicitly looking for [this deployment option]?
                               (Percentages based on respondents who answered “Yes”)


                            Analytical database
                           Analytical appliance
                              Analytical service
   EXECUTIVE
   SUMMARY          File-based analytical system

                                                   0    10         20          30          40          50           60

   RESEARCH
  BACKGROUND

                    and money.
                      “Agility and interoperability with existing technologies were key drivers for
 WHY BIG DATA?
                    the customer,” said Rick Glick, vice president of customer and partner devel-
                    opment at ParAccel.
    BIG DATA
   ANALYTICS:
 DERIVING VALUE
 FROM BIG DATA
                    I Selection by category. We also asked respondents if they were looking for a
                    specific deployment option when evaluating products (see Figure 13). Except
                    for customers of file-based systems, most customers investigated products
 ARCHITECTURE       irrespective of its technology category. For example, Blue Cross Blue Shield of
  FOR BIG DATA
   ANALYTICS        Kansas City looked at three columnar databases (i.e., software-only) and an
                    appliance before making a decision. Interestingly, no analytical service cus-
                    tomers intended to subscribe to a service prior to evaluating products. That’s
 PLATFORMS FOR      because many analytical service customers subscribe to such services on a
RUNNING BIG DATA
   ANALYTICS        temporary basis, either to test or prototype a system or to wait until the IT
                    department readies the hardware to house the system. Some of these cus-
                    tomers continue with the services, recognizing that they provide a more cost-
PROFILING THE USE   effective test and development environment than an in-house system.
 OF ANALYTICAL
   PLATFORMS


                    BUSINESS APPLICATIONS
 RECOMMENDA-        When push comes to shove, the value of an analytical platform is judged not
     TIONS
                    by its technical merits, but by the business applications it supports or makes
                    possible. The most popular business applications running on analytical plat-
                    forms are customer analytics, followed by management reports, financial ana-
                    lytics, data integration, executive dashboards and risk analytics. This ranking is


                              BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   38
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    based on summing the percentages of all four deployment options for each
                    requirement (see Figure 14).


                    Figure 14: Business applications by deployment option
                    (Sorted from most to least for the percentage total of all four deployment options)
   EXECUTIVE
   SUMMARY


                       Customer analytics
   RESEARCH
  BACKGROUND

                     Management reports

 WHY BIG DATA?

                        Financial analytics


    BIG DATA
   ANALYTICS:
                          Data integration
 DERIVING VALUE
 FROM BIG DATA


                     Executive dashboards

 ARCHITECTURE
  FOR BIG DATA
   ANALYTICS                 Risk analytics



 PLATFORMS FOR        Web traffic analytics
RUNNING BIG DATA
   ANALYTICS
                                                                                          I Analytical database
                    Supply chain analytics                                                I Analytical appliance
                                                                                          I Analytical service
PROFILING THE USE                                                                         I File-based analytical system
 OF ANALYTICAL
   PLATFORMS            Logistics analytics




 RECOMMENDA-                    Cross-sell
     TIONS


                    Social media analytics



                                              0   10    20     30     40     50     60     70       80      90      100



                              BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS      39
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    Figure 15: ROI by deployment option


                            Analytical database
                           Analytical appliance
                              Analytical service
                    File-based analytical system
   EXECUTIVE
   SUMMARY
                                                   0       10            20           30            40              50



   RESEARCH
  BACKGROUND

                       Figure 14 also exposes stark differences in the business applications sup-
                    ported by each deployment option. For example, an analytical appliance is
 WHY BIG DATA?
                    more likely to be used for customer analytics (80%), risk analytics (40%),
                    and cross-sell recommendations (29%) than analytical databases, which are
    BIG DATA        more likely to be used for management reports (82%) and executive dash-
   ANALYTICS:       boards (60%). Thus, analytical databases are more likely to be used for tradi-
 DERIVING VALUE
 FROM BIG DATA      tional top-down reporting, while analytical appliances are used for bottom-up
                    analytics. This contrast makes sense when you remember that many of our
                    analytical database users are customers of Microsoft SQL Server and Oracle
 ARCHITECTURE       10, which are best-suited to reporting, not analytics. Figure 14 also shows that
  FOR BIG DATA
   ANALYTICS        file-based systems are twice as likely to be used for Web traffic analysis
                    (46%) and social media analysis (15%) than the other options.

 PLATFORMS FOR      I ROI. Not surprisingly, given its emphasis on analytics versus reporting, ana-
RUNNING BIG DATA
   ANALYTICS        lytical appliances (35%) have a higher ROI than analytical databases (26%),
                    with analytical services close behind at 33%. Given their newness, file-based
                    systems delivered a surprisingly strong 25% ROI, but that’s probably because
PROFILING THE USE   most file-based systems are open source and don’t require an up-front invest-
 OF ANALYTICAL
   PLATFORMS        ment in software (see Figure 15).


 RECOMMENDA-        TECHNICAL ATTRIBUTES
     TIONS
                    IApplications and users. When examining business attributes of each
                    deployment option, it’s clear that analytical appliances support far more
                    applications and users than analytical services, analytical databases or file-
                    based systems. The analytical appliance figure is perhaps skewed by the high


                              BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   40
PROFILING THE USE OF ANALYTICAL PLATFORMS




                    Table 3: Applications and users
                                                 ANALYTICAL       ANALYTICAL       ANALYTICAL         FILE-BASED
                                                  DATABASE        APPLIANCE         SERVICE        ANALYTICAL SYSTEM

                    Average number
                    of applications                 5.9              11.3              8.0                4.9

   EXECUTIVE        Average number
   SUMMARY          of concurrent users             47.1             81.4              27.5              27.8



   RESEARCH
  BACKGROUND        number of Teradata Active DW customers who responded to the survey. Tera-
                    data Active EDW is geared to supporting multiple workloads, serving as a data
                    warehouse, data mart and operational data store. In addition, its customers
 WHY BIG DATA?
                    have used the product for many years, and the longer a product is used, the
                    more applications it tends to support (see Table 3).
    BIG DATA
   ANALYTICS:       IData volumes. Analytical appliances and file-based systems are neck and
 DERIVING VALUE
 FROM BIG DATA      neck in terms of the amount of data they store. More than 40% of both sets of
                    customers use the systems to store between 10 TB and 100 TB of data, and
                    more than 14% of both options store over 100 TB. In contrast, 40% of analyti-
 ARCHITECTURE       cal database customers have less than 1
  FOR BIG DATA
   ANALYTICS        TB of data (see Figure 16, page 42).

                    I Types of data. Not surprisingly, the         Analytical appliances
 PLATFORMS FOR      analytical database and analytical appli-
RUNNING BIG DATA
                    ance, both of which rely on relational
                                                                   and file-based systems
   ANALYTICS
                    technology, primarily hold structured          areneck and neck in
                    data (90% and 95% respectively). In            terms of the amount
PROFILING THE USE   contrast, analytical services and file-
 OF ANALYTICAL                                                     of data they store.
   PLATFORMS        based analytical systems hold a more
                    balanced mix of data types. More than
                    three-quarters (78%) of analytical serv-
 RECOMMENDA-        ices customers manage structured data,
     TIONS
                    while 67% manage semi-structured data and 33% unstructured data. In con-
                    trast, file-based systems have more semi-structured data (73%) than either
                    structured (67%) or unstructured (33%). This high percentage reflects the
                    trend of companies insourcing Web data from service bureaus so they can


                              BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   41
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    combine it with other corporate data, such as sales and orders, and derive
                    more value from it (see Figure 17).


                    Figure 16: Volume of raw data

   EXECUTIVE
   SUMMARY

                      Less than 100 GB

   RESEARCH
                                                                                   I Analytical database
  BACKGROUND
                        100 GB to 1 TB                                             I Analytical appliance
                                                                                   I Analytical service
                                                                                   I File-based analytical system
 WHY BIG DATA?
                           1 TB to 5 TB



    BIG DATA
                         5 TB to 10 TB
   ANALYTICS:
 DERIVING VALUE
 FROM BIG DATA
                       10 TB to 100 TB


 ARCHITECTURE
  FOR BIG DATA                100 TB+
   ANALYTICS



                                          0   10    20     30      40     50     60      70      80       90        100
 PLATFORMS FOR
RUNNING BIG DATA
   ANALYTICS
                    Figure 17: Types of data

PROFILING THE USE
 OF ANALYTICAL
                            Structured
   PLATFORMS



                       Semi-structured
 RECOMMENDA-
     TIONS
                                                                                   I Analytical database
                         Unstructured                                              I Analytical appliance
                                                                                   I Analytical service
                                                                                   I File-based analytical system

                                          0   10    20     30      40     50     60      70      80       90        100



                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS          42
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    ARCHITECTURE
                    Architectural roles are fairly consistent across deployment options. The
                    noticeable exception is that analytical appliances and analytical services are
                    much more likely to be used for data warehouses than the other options. Also,
                    analytical services and file-based systems are more likely to be used for proto-
                    typing than other systems, and analytical databases are more likely to be used
   EXECUTIVE        as independent data marts (see Figure 18).
   SUMMARY
                       In addition, the most prominent use of file-based systems is for “prototyp-
                    ing” (44%), followed by “staging area” (38%) and “data warehouse” (38%).
   RESEARCH         Currently, Hadoop is in its early days, and many companies are experimenting
  BACKGROUND        with the technology, which explains the high percentage of prototyping appli-
                    cations. But it’s often used to stage and process Web traffic so companies
                    can summarize and transfer the data into the data warehouse for analysis.
 WHY BIG DATA?
                    But some companies aren’t moving this data into data warehouses; they are
                    simply leaving it in Hadoop and allowing data scientists to query this “data
    BIG DATA
   ANALYTICS:
 DERIVING VALUE
 FROM BIG DATA      Figure 18: Architecture by deployment option


 ARCHITECTURE
  FOR BIG DATA               Staging area
   ANALYTICS



                          Data warehouse
 PLATFORMS FOR
RUNNING BIG DATA
   ANALYTICS
                     Dependant data mart

                                                                                     I Analytical database
PROFILING THE USE                                                                    I Analytical appliance
 OF ANALYTICAL      Independant data mart                                            I Analytical service
   PLATFORMS                                                                         I File-based analytical system


                        Development/test
 RECOMMENDA-
     TIONS

                              Prototyping



                                            0   10     20     30     40     50     60      70      80       90        100




                             BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS        43
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    warehouse” of semi-structured or unstructured data.
                       I’m surprised by the 79% of analytical appliance customers who are using
                    the system as a data warehouse. I believe this reflects the large percentage of
                    Teradata Active DW customers who took the survey. But it’s not dissimilar
                    from 2010 survey results that showed that 68% of companies were using ana-
                    lytical platforms in general to power their data warehouses. What I’ve discov-
   EXECUTIVE        ered anecdotally is that companies that power their data warehouse with
   SUMMARY
                    Microsoft SQL Server are more likely to replace the product with an analytical
                    platform than augment it. Conversely, companies with scalable data ware-
   RESEARCH         house RDBMSs are more likely to aug-
  BACKGROUND        ment the RDBMS with an analytical plat-
                    form than replace it.
                       I’m also surprised that analytical data-
 WHY BIG DATA?                                                        Companies that power
                    bases are the leading platform for
                    dependent and independent data marts,             their data warehouse with
    BIG DATA        with 47% and 45% of customers select-             Microsoft SQL Server are
   ANALYTICS:       ing these architectural roles respectively.       more likely to replace the
 DERIVING VALUE
 FROM BIG DATA      This undoubtedly reflects the large num-
                    ber of Microsoft SQL Server customers             product with an analytical
                    who took the survey, but it’s also not too        platform than augment
 ARCHITECTURE       far out of line with our 2010 survey              it. Conversely, companies
  FOR BIG DATA
                    results.
   ANALYTICS
                                                                           with scalable data ware-
                                                                           house RDBMSs are more
 PLATFORMS FOR
RUNNING BIG DATA
                    TECHNICAL REQUIREMENTS                                 likely to augment the
   ANALYTICS        The technical requirements for selecting               RDBMS with an analytical
                    products varied widely by deployment
                    option. Figure 19 (see page 45) ranks                  platform than replace it.
PROFILING THE USE   the requirements by sum of the percent-
 OF ANALYTICAL
   PLATFORMS        ages across all four deployment options.
                    This shows that the top technical
                    requirements are “supports our preferred ETL/BI tools,” “automated distribu-
 RECOMMENDA-        tion of data” and “use of commodity servers.” This is followed by “MPP,” “built-
     TIONS
                    in fast loading,” “supports unstructured data,” “supports our preferred operat-
                    ing system” and “mixed workload.”
                       What’s striking is the variation in support for these requirements by deploy-
                    ment option. For example, interoperability with existing BI and ETL tools is a


                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   44
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    critical requirement for all options except the file-based system. This makes
                    sense, since most Hadoop developers write custom code in Java, Perl or
                    some other language to construct queries rather than use packaged BI tools.


                    Figure 19: Technical requirements
   EXECUTIVE
   SUMMARY


                       Supports preferred
                            ETL/BI tools
   RESEARCH
  BACKGROUND
                               Automated
                       distribution of data

 WHY BIG DATA?
                        Use of commodity
                                  servers

    BIG DATA
   ANALYTICS:                           MPP
 DERIVING VALUE
 FROM BIG DATA

                       Built-in fast loading
                                     utilities
 ARCHITECTURE
  FOR BIG DATA
   ANALYTICS        Supports unstructured
                                     data


                    Supports our preferred                                             I Analytical database
 PLATFORMS FOR                                                                         I Analytical appliance
RUNNING BIG DATA
                        operating system
                                                                                       I Analytical service
   ANALYTICS                                                                           I File-based analytical system
                           Mixed workload


PROFILING THE USE
 OF ANALYTICAL
                               Open source
   PLATFORMS



                                 Supports
 RECOMMENDA-                   MapReduce
     TIONS


                    Supports our preferred
                                 hardware



                                                 0   10   20    30     40     50     60     70       80       90        100



                               BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS        45
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    However, BI and ETL vendors are extending their products to interoperate with
                    Hadoop, so this will undoubtedly change, since it’s often easier to use tools
                    than write code.
                      Another variation is that file-based customers are much more interested in
                    “commodity servers,” “open source” and “MapReduce” than customers of
                    other deployment options. This makes sense, since all three requirements are
   EXECUTIVE        critical elements of a Hadoop ecosystem. In contrast, analytical appliances are
   SUMMARY
                    concerned with MPP, fast-loading utilities and mixed workload functionality.
                    This aligns with the predominant needs
   RESEARCH         of Teradata Active DW customers, who
  BACKGROUND        constituted a large portion of the analyti-
                    cal appliance respondents.                      It came as no surprise
 WHY BIG DATA?
                                                                           that support and service
                                                                           are more critical for
                    VENDORS
    BIG DATA        When asked why they selected their                     customers of analytical
   ANALYTICS:
 DERIVING VALUE
                    chosen vendors, respondents were                       services.
 FROM BIG DATA      mostly likely to say, “Met more of our
                    requirements,” followed by “successful
                    proof of concept,” “liked and trusted the
 ARCHITECTURE       vendor” and “support and service.” Interestingly, pricing was ranked fifth on
  FOR BIG DATA
   ANALYTICS        the list, followed by customer references and vendor incumbency (see Figure
                    20, page 47).
                       A vendor’s ability to meet more of a customer’s requirements was more
 PLATFORMS FOR      important for analytical appliance customers than customers of other deploy-
RUNNING BIG DATA
   ANALYTICS        ment options. This is for two reasons: (1) Many analytical appliances are new
                    products from startup vendors (except Teradata Active DW), so customers
                    need to make doubly sure that the product meets their requirements, since
PROFILING THE USE   these vendors have less of a track record and (2) the customer is spending a
 OF ANALYTICAL
   PLATFORMS        significant amount of money on the product, and it’s playing a central role in
                    the data warehousing architecture. (This is also true for Teradata Active DW
                    customers.)
 RECOMMENDA-           It came as no surprise that support and service are more critical for cus-
     TIONS
                    tomers of analytical services, who are totally dependent on the quality and
                    responsiveness of the vendor to meet their needs. Pricing is also clearly a big-
                    ger issue for analytical service customers than others, since price (or more
                    likely lack of an up-front capital investment) is a major inducement to hand


                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   46
PROFILING THE USE OF ANALYTICAL PLATFORMS



                    over responsibility for corporate data to a third party.

                    I Incumbency. Interestingly, incumbency can cut both ways. Blue Cross Blue
                    Shield of Kansas City decided not to examine an appliance product from its
                    incumbent data warehouse vendor because it didn’t want to expand its rela-
                    tionship with that vendor. At the same time, it considered a columnar data-
   EXECUTIVE        base from Sybase because it had an existing relationship with the vendor in
   SUMMARY
                    another area of the business. I

   RESEARCH
  BACKGROUND
                    Figure 20: Vendor selection criteria


 WHY BIG DATA?
                            Met more of our
                              requirements

    BIG DATA
   ANALYTICS:               Successful POC
 DERIVING VALUE
                                                                                          I Analytical database
 FROM BIG DATA
                                                                                          I Analytical appliance
                                                                                          I Analytical service
                           Liked and trusted                                              I File-based analytical system
                                     vendor
 ARCHITECTURE
  FOR BIG DATA
   ANALYTICS            Support and service



 PLATFORMS FOR                       Pricing
RUNNING BIG DATA
   ANALYTICS

                        Customer references

PROFILING THE USE
 OF ANALYTICAL
   PLATFORMS              Incumbent vendor




 RECOMMENDA-             Pre-sales and sales
     TIONS                          process


                                      Other



                                               0   10    20     30     40     50     60        70       80      90         100



                               BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS           47
RECOMMENDATIONS




                    Recommendations
   EXECUTIVE
   SUMMARY
                    TO ADDRESS THE  information needs of the modern corporation, organizations
                    should take the following steps:
   RESEARCH




                    1
  BACKGROUND             Support both top-down and bottom-up business requirements. For too
                         long, organizations have tried to shoehorn all types of users into a single
                         information architecture. That has never worked. Organizations need to
 WHY BIG DATA?
                    recognize that casual users, who represent a majority of their business users,
                    primarily need top-down, interactive reports and dashboards, while power
    BIG DATA        users need ad hoc exploratory tools and environments. Balancing these polar
   ANALYTICS:       opposite requirements in a single architecture requires new thinking.
 DERIVING VALUE
 FROM BIG DATA




                    2
                           Implement a new BI architecture. The BI architecture of the future incor-
                           porates traditional data warehousing technologies to handle detailed
 ARCHITECTURE              transactional data and file-based and nonrelational systems to handle
  FOR BIG DATA
   ANALYTICS        unstructured and semi-structured data. The key is to integrate these systems
                    into a unified architecture that enables casual and power users to query,
                    report and analyze any type of data in a relatively seamless manner. This uni-
 PLATFORMS FOR      fied information access is the hallmark of the next generation BI architecture.
RUNNING BIG DATA
   ANALYTICS        More immediately, companies are using Hadoop to preprocess unstructured
                    data so that it can be loaded and integrated with other corporate data for
                    reporting and analysis. This allows BI and ETL users to use familiar tools to
PROFILING THE USE   query and analyze data.
 OF ANALYTICAL
   PLATFORMS




                    3
                          Create analytical sandboxes. The new BI architecture brings power
                          users more fully into the corporate information architecture by creating
 RECOMMENDA-              analytical sandboxes that enable them to mix personal and corporate
     TIONS
                    data and run complex, ad hoc queries with minimal restrictions. Types of ana-
                    lytical sandboxes include (1) virtual sandboxes running as partitions within a
                    data warehouse, (2) a free-standing data mart running a replica of the data
                    warehouse or other data not available in the data warehouse and powered by


                           BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS   48
RECOMMENDATIONS



                    an analytical platform or nonrelational database, (3) an in-memory BI tool that
                    runs on an analyst’s desktop or a corporate server and (4) a Hadoop cluster
                    that stores atomic-level unstructured or semi-structured data.




                    4
                           Implement analytical platforms that meet business and technical
                           requirements. Today, organizations implement analytical platforms for
   EXECUTIVE               various reasons. For example, analytical appliances are fast to deploy
   SUMMARY
                    and easy to maintain and make good replacements for Microsoft SQL Server
                    or Oracle data warehouses that have run out of gas and are ideal as free-
   RESEARCH         standing data marts that offload complex queries from large, maxed-out data
  BACKGROUND        warehousing hubs. Analytical databases, as software-only solutions, run on a
                    variety of hardware platforms and are good for organizations that want to tune
                    database performance for specific workloads or run the RDBMS software on a
 WHY BIG DATA?
                    virtualized private cloud. Analytical services are great for development, test
                    and prototyping applications as well as for organizations that don’t have an IT
    BIG DATA        department or want to outsource data center operations or get up and running
   ANALYTICS:       very quickly. File-based analytical systems and nonrelational databases are
 DERIVING VALUE
 FROM BIG DATA      ideal for processing large volumes of Web traffic and other log-based or
                    machine-generated data. Organizations need to carefully evaluate the type
                    and capabilities of the analytical platform they need before purchasing and
 ARCHITECTURE       deploying a system. I
  FOR BIG DATA
   ANALYTICS




 PLATFORMS FOR
RUNNING BIG DATA
   ANALYTICS
                                        ABOUT THE AUTHOR
                                        Wayne Eckerson has been a thought leader in the data warehousing, business intelli-
                                        gence and performance management fields since 1995. He has conducted numerous
PROFILING THE USE                       in-depth research studies and is the author of the best-selling book Performance Dash-
 OF ANALYTICAL                          boards: Measuring, Monitoring, and Managing Your Business. He is a noted keynote
   PLATFORMS                            speaker and blogger and he consults and conducts workshops on business analytics,
                                        performance dashboards and business intelligence (BI), among other topics. For many
                    years, Eckerson served as director of education and research at The Data Warehousing Institute (TDWI),
 RECOMMENDA-        where he oversaw the company’s content and training programs and chaired its BI Executive Summit.
     TIONS             Eckerson is currently director of research at TechTarget, where he writes a popular weekly blog called
                    Wayne’s World, which focuses on industry trends and examines best practices in the application of busi-
                    ness intelligence. (See www.b-eye-network.com/blogs/eckerson.) Wayne is also president of BI Leader
                    Consulting (www.bileader.com) and founder of BI Leadership Forum (www.bileadership.com), a network
                    of BI directors who meet regularly to exchange ideas about best practices in BI and educate the larger BI
                    community. He can be reached at weckerson@techtarget.com.



                              BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS                49
RESOURCES FROM OUR SPONSOR




• The Business Case for Transparent Databases, by Sandy Steier (White Paper)

• BIG DATA Analytics, This time it’s personal, by Robin Bloor (White Paper)

• AutoZone’s Cloud BI Initiative, The Endless Possibilities of Cloud BI (Case Study)



About 1010data:
1010data is the first interactive cloud-based platform for Big Data analytics. The company’s
namesake service provides the most powerful, usable and scalable solution available today for
investigative and predictive analytics. It does this by combining ultra-fast database technology
with a rich and sophisticated array of built-in analytical functions and an intuitive worksheet
user interface, and delivers them as a managed service that offers the fastest time to value.
And, with 1010data, there is no need for complex, time-consuming data design, integration, or
transformation steps. The 1010data cloud also hosts and enables access to a growing number
of large proprietary and public data sets, including ones for credit reporting, mortgage-backed
securities, real estate, labor statistics, and more. 1010data is used by hedge funds, global
banks, large securities exchanges, top retailers, leading consumer packaged goods companies,
and many others industry leaders to manage, manipulate and monetize trillions of business
data records every day
RESOURCES FROM OUR SPONSOR




• LiveRail Implements Infobright and Hadoop for Video Advertising

• What’s Cool About Columns...and How to Extend Their Benefits

• Infobright Analytic Database: Architecture Overview



About Infobright:
Infobright develops and markets a high performance, self-tuning analytic database designed
for applications and data marts that analyze Big Data, especially “machine-generated data”
such as web data, network logs, telecom records, stock tick data and sensor data. Easy to
implement and with unmatched data compression, operational simplicity and low cost,
Infobright is being used by enterprises, SaaS and software companies in online businesses,
telecommunications, financial services and other industries to provide rapid analysis of critical
business data.
RESOURCES FROM OUR SPONSOR




• Big Data Unleashed: Turning Big Data into Big Opportunities White Paper

• ESG Report: Informatica 9.1 and Integrating Big Data

• Integrating Social Media into MDM Demo



About Informatica:
Informatica is the world’s number one independent provider of data integration software. With
Informatica, thousands of organizations turn to Informatica to gain a competitive advantage in
today's global information economy with timely, relevant and trustworthy data for their top
business imperatives. Enterprises rely on Informatica Data Integration and Data Quality
solutions to gain a competitive advantage from their information assets to grow revenues,
increase profitability, further regulatory compliance and foster customer loyalty. The
 Informatica Platform provides a comprehensive, unified and open approach to lower IT costs
and gain competitive advantage from their data held in traditional enterprise and in the
internet cloud.
RESOURCES FROM OUR SPONSOR




• The Post-Relational Reality Sets In: 2011 Survey on Unstructured Data

• Leading Analyst Predicts Big Changes from Big Data: Exclusive Interview Recording

• Addressing the Challenges of Unstructured Information with Purpose-built Technology



About MarkLogic:
MarkLogic empowers organizations to make high stakes decisions on Big Data in real time.
Customers trust MarkLogic for mission critical applications that drive revenue and growth
through Big Data Analytics enabled by MarkLogic products, services, and partners. MarkLogic
is a fast growing enterprise software company that has been providing solutions to the public
sector and Global 1000 for nearly a decade. Operating at petabyte scale, MarkLogic Server is a
next generation database for unstructured information that allows customers to outflank their
competition by consistently getting to better decisions faster.

MarkLogic is headquartered in Silicon Valley with field offices in Austin, Boston, Frankfurt,
London, Tokyo, New York, and Washington D.C.
RESOURCES FROM OUR SPONSOR




• ParAccel Analytic Platform Datasheet

• ParAccel Video: Trends and Architectures for Modern Analytics

• ParAccel Deployment Options: On-Premise and in the Cloud



About ParAccel:
ParAccel's analytic platform is built from the ground up to provide the highest performance for
the widest variety of analytic workloads. It includes 500+ advanced in-database functions
along with dynamic analytic and data integration. ParAccel's software only approach enables
physical and virtual deployment, either on-premise or in public/private clouds.
RESOURCES FROM OUR SPONSOR




• How to rapidly deploy BI applications across the Enterprise

• Agile BI deployment for Enterprise

• Why simple scalability is the key to Big Data



About SAND:
SAND is the world's most advanced analytic database, managing massive amounts of big
data, driving unparalleled performance, and deploying information to tens of thousands of
concurrent users across the enterprise. With industry-leading software solutions for CRM and
Loyalty, and having achieved "Certified for SAP NetWeaver" status and "Powered by SAP
NetWeaver" status, SAND delivers best-of-bread analytic performance to over 600 customers
around the world. SAND Technology has offices in the United States, Canada, Western and
Central Europe, and Australia and can be reached online at www.sand.com.
RESOURCES FROM OUR SPONSOR




• Leveling the Playing Field: How Companies Use Data for Competitive Advantage

• The Intelligence Future: Simple, Seamless, Social

• SAP HANA: Helping Businesses Run Better - in Real-Time



About SAP:
SAP's vision is for companies of all sizes to become best-run businesses. Best-run businesses
transform rigid value chains into dynamic business networks of customers, partners, and sup-
pliers. They close the loop between strategy and execution, help individuals work more pro-
ductively, and leverage technology for sustainable, profitable growth. This vision is in keeping
with SAP's mission to accelerate business innovation for companies and industries worldwide
- contributing to economic development on a grand scale.
RESOURCES FROM OUR SPONSOR




• The Transition Layer: The Role of Analytical Talent

• Nine Data Prep Lessons for Advanced Analytics

• Operational Analytics: Putting Analytics to work in Operational Systems



About SAS:
SAS is the leader in business analytics software and services, and the largest independent
vendor in the business intelligence market. Through innovative solutions delivered within an
integrated framework, SAS helps customers at more than 50,000 sites improve performance
and deliver value by making better decisions faster. Since 1976 SAS has been giving customers
around the world The Power to Know®.

Big data analytics, research report

  • 1.
    Big Data Analytics: Profiling the Use of Analytical Platforms in User Organizations BY WAYNE ECKERSON Director of Research, Business Applications and Architecture Group, TechTarget, September 2011 BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 1
  • 2.
  • 3.
    EXECUTIVE SUMMARY Executive Summary EXECUTIVE THIS REPORT EXAMINES the rise of “big data” and the use of analytics to mine SUMMARY that data. Companies have been storing and analyzing large volumes of data since the advent of the data warehousing movement in the early 1990s. While RESEARCH terabytes used to be synonymous with big data warehouses, now it’s peta- BACKGROUND bytes, and the rate of growth in data volumes continues to escalate as organi- zations seek to store and analyze greater levels of transaction details, as well as Web- and machine-generated data, to gain a better understanding of cus- WHY BIG DATA? tomer behavior and drivers. BIG DATA I Analytical platforms. To keep pace with the desire to store and analyze ever ANALYTICS: larger volumes of structured data, relational database vendors have delivered DERIVING VALUE FROM BIG DATA specialized analytical platforms that pro- vide dramatically higher levels of price-per- formance compared with general-purpose ARCHITECTURE relational database management systems Companies have been FOR BIG DATA ANALYTICS (RDBMSs). These analytical platforms storing and analyzing come in many shapes and sizes, from soft- ware-only databases and analytical appli- large volumes of data PLATFORMS FOR ances to analytical services that run in a since the advent of RUNNING BIG DATA ANALYTICS third-party hosted environment. Almost the data warehousing three-quarters (72%) of our survey respon- movement in the dents said they have implemented an ana- PROFILING THE USE lytical platform that fits this description. early 1990s. OF ANALYTICAL PLATFORMS In addition, new technologies have emerged to address exploding volumes of complex structured data, including Web RECOMMENDA- traffic, social media content and machine-generated data, such as sensor and TIONS Global Positioning System (GPS) data. New nonrelational database vendors combine text indexing and natural language processing techniques with tradi- tional database technology to optimize ad hoc queries against semi-struc- tured data. And many Internet and media companies use new open source BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 3
  • 4.
    EXECUTIVE SUMMARY frameworks such as Hadoop and MapReduce to store and process large vol- umes of structured and unstructured data in batch jobs that run on clusters of commodity servers. I Business users. In the midst of these platform innovations, business users await tools geared to their information requirements. Casual users—execu- EXECUTIVE tives, managers, front-line workers—primarily use reports and dashboards SUMMARY that deliver answers to predefined ques- tions. Power users—business analysts, RESEARCH analytical modelers and data scientists— BACKGROUND perform ad hoc queries against a variety Most business intelligence of sources. Most business intelligence (BI) environments have (BI) environments have done a poor job WHY BIG DATA? meeting these diverse needs within a done a poor job meeting single, unified architecture. But this is these diverse needs BIG DATA changing. within a single, unified ANALYTICS: DERIVING VALUE architecture. But this is FROM BIG DATA I Unified architecture. This report por- trays a unified reporting and analysis changing. environment that finally turns power ARCHITECTURE users into first-class corporate citizens FOR BIG DATA ANALYTICS and makes unstructured data a legiti- mate target for ad hoc and batch queries. The new architecture leverages new analytical technology to stage, store and process large volumes of structured PLATFORMS FOR and unstructured data, turbo-charge sluggish data warehouses and offload RUNNING BIG DATA ANALYTICS complex analytical queries to dedicated data marts. Besides supporting stan- dard reports and dashboards, it creates a series of analytical sandboxes that enable power users to mix personal and corporate data and run complex ana- PROFILING THE USE lytical queries that fuel the modern-day corporation. I OF ANALYTICAL PLATFORMS RECOMMENDA- TIONS BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 4
  • 5.
    RESEARCH BACKGROUND Research Background EXECUTIVE THE PURPOSE OF this report is to profile the use of analytical platforms in user SUMMARY organizations. It is based on a survey of 302 BI professionals as well as inter- views with BI practitioners at user organizations and BI experts at consultan- RESEARCH cies and software companies. BACKGROUND I Survey. The survey consists of 25 pages of questions (approximately 50 questions) with four branches, one for each analytical platform deployment WHY BIG DATA? option: analytical database (software-only), analytical appliance (hardware-software BIG DATA combo), analytical service and file-based ANALYTICS: analytical system (e.g., Hadoop and [This report] is based DERIVING VALUE FROM BIG DATA NoSQL). Respondents who didn’t select an option were passed to a fifth branch where on a survey of 302 they were asked why they hadn’t purchased BI professionals as ARCHITECTURE an analytical platform and whether they well as interviews with FOR BIG DATA ANALYTICS planned to do so. The survey ran from June 22 to August 2, BI practitioners and 2011, and was publicized through several BI experts. PLATFORMS FOR channels. The BI Leadership Forum and RUNNING BIG DATA ANALYTICS BeyeNetwork sent several email broadcasts to their lists. I tweeted about the survey and asked followers to retweet the announcement. Several sponsors, including PROFILING THE USE Teradata, Infobright, and ParAccel, notified their customers about the survey OF ANALYTICAL PLATFORMS through email broadcasts and newsletters. I Respondent profile. Survey respondents are generally IT managers based in RECOMMENDA- North America who work at large companies in a variety of industries (see TIONS Figures 1-4, page 6). I BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 5
  • 6.
    RESEARCH BACKGROUND Figure 1: Which best describes your position in BI? VP/Director Architect Manager Consultant EXECUTIVE SUMMARY Analyst Administrator Developer RESEARCH BACKGROUND Other 0 5 10 15 20 25 30 WHY BIG DATA? Figure 2: Where are you located? Figure 3: What size is your BIG DATA organization by revenues? ANALYTICS: DERIVING VALUE FROM BIG DATA North America . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66.7% Large ($1B + revenues) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52.4% Europe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16.5% Medium ($50M to $1B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24.8% Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16.9% Small (<$50M) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22.8% ARCHITECTURE FOR BIG DATA ANALYTICS Figure 4: In what industry do you work? PLATFORMS FOR RUNNING BIG DATA Retail ANALYTICS Consulting Banking Insurance PROFILING THE USE OF ANALYTICAL Computers PLATFORMS Telecommunications Software RECOMMENDA- Manufacturing TIONS Health Care Payor Hospitality/Travel Other 0 5 10 15 20 25 30 35 BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 6
  • 7.
    WHY BIG DATA? Why Big Data? EXECUTIVE THERE HAS BEEN a lot of talk about “big data” in the past year, which I find a bit SUMMARY puzzling. I’ve been in the data warehousing field for more than 15 years, and data warehousing has always been about big data. RESEARCH Back in the late 1990s, I attended a ceremony honoring the Terabyte Club, BACKGROUND a handful of companies that were storing more than a terabyte of raw data in their data warehouses. Fast-forward more than 10 years and I could now be attending a ceremony for the Petabyte Club. The trajectory of data acquisition WHY BIG DATA? and storage for reporting and analytical applications has been steadily expanding for the past 15 years. BIG DATA So what’s new in 2011? Why are we ANALYTICS: are talking about big data today? There DERIVING VALUE FROM BIG DATA are several reasons: The growth in data is 1. Changing data types. Organizations fueled by largely unstruc- ARCHITECTURE are capturing different types of data tured data from websites FOR BIG DATA ANALYTICS today. Until about five years ago, most and machine-generated data was transactional in nature, con- sisting of numeric data that fit easily data from an exploding PLATFORMS FOR into rows and columns of relational number of sensors. RUNNING BIG DATA ANALYTICS databases. Today, the growth in data is fueled by largely unstructured data from websites (e.g, Web traffic data PROFILING THE USE and social media content) as well as machine-generated data from an explod- OF ANALYTICAL PLATFORMS ing number of sensors. Most of the new data is actually semi-structured in for- mat, because it consists of headers followed by text strings. Pure unstructured data, such as audio and video data, has limited textual content and is more dif- RECOMMENDA- ficult to parse and analyze, but it is also growing (see Figure 5, page 8). TIONS 2. Technology advances. Hardware has finally caught up with software. The exponential gains in price-performance exhibited by computer processors, memory and disk storage have finally made it possible to store and analyze BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 7
  • 8.
    WHY BIG DATA? large volumes of data at an affordable price. Database vendors have exploited these advances by developing new high-speed analytical platforms designed to accelerate query performance against large volumes of data, while the open source com- munity has developed Hadoop, a distributed file management system designed to capture, Organizations are EXECUTIVE store and analyze large volumes of Web log SUMMARY storing and analyzing data, among other things. In other words, organizations are storing and analyzing more more data because RESEARCH data because they can. they can. BACKGROUND 3. Insourcing and outsourcing. Because of the complexity and cost of storing and analyzing WHY BIG DATA? Web traffic data, most organizations traditionally outsourced these functions to third-party service bureaus like Omniture. But as the size and importance BIG DATA of corporate e-commerce channels have increased, many are now eager to ANALYTICS: insource this data to gain greater insights about customers. For example, DERIVING VALUE FROM BIG DATA ARCHITECTURE Figure 5: Data growth FOR BIG DATA ANALYTICS PLATFORMS FOR RUNNING BIG DATA I Unstructured and content depot ANALYTICS I Structured and replicated PROFILING THE USE OF ANALYTICAL PLATFORMS RECOMMENDA- TIONS 2005 2006 2007 2008 2009 2010 2011 2012 SOURCE: IDC DIGITAL UNIVERSE 2009: WHITE PAPER, SPONSORED BY EMC, 2009. BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 8
  • 9.
    WHY BIG DATA? automobile valuation company Kelley Blue Book is now collecting and storing Web traffic data in-house so it can combine that information with sales and other corporate data to better understand customer behavior, according to Dan Ingle, vice president of analytical insights and technology at the company. At the same time, virtualization tech- nology is beginning to make it attractive EXECUTIVE for organizations to consider moving SUMMARY large-scale data processing outside ”We are the beginning their data center walls to private hosted of an amazing world of RESEARCH networks or public clouds. BACKGROUND data driven applications. 4. Developers discover data. The It's up to us to shape biggest reason for the popularity of the the world.” WHY BIG DATA? term big data is that Web and applica- tion developers have discovered the —TIM O’REILLY, founder, O'Reilly Media BIG DATA value of building new data-intensive ANALYTICS: applications. To application developers, DERIVING VALUE FROM BIG DATA big data is new and exciting. Tim O’Reilly, founder of O’Reilly Media, a longtime high-tech luminary and open source proponent, speaking at Hadoop ARCHITECTURE World in New York in November 2010, said: "We are the beginning of an FOR BIG DATA ANALYTICS amazing world of data-driven applications. It's up to us to shape the world." Of course, for those of us who have made their careers in the data world, the new era of “big data” is simply another step in the evolution of data management PLATFORMS FOR systems that support reporting and analysis applications. I RUNNING BIG DATA ANALYTICS PROFILING THE USE OF ANALYTICAL PLATFORMS RECOMMENDA- TIONS BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 9
  • 10.
    BIG DATA ANALYTICS:DERIVING VALUE FROM BIG DATA Big Data Analytics: Deriving Value from Big Data EXECUTIVE SUMMARY BIG DATA BY itself, regardless of the type, is worthless unless business users RESEARCH do something with it that delivers value to their organizations. That’s where BACKGROUND analytics comes in. Although organizations have always run reports against data warehouses, most haven’t opened these repositories to ad hoc explo- ration. This is partly because analysis tools are too complex for the average WHY BIG DATA? user but also because the repositories often don’t contain all the data needed by the power user. But this is changing. BIG DATA ANALYTICS: I Big vs. small data. A valuable characteristic of “big” data is that it contains DERIVING VALUE FROM BIG DATA more patterns and interesting anomalies than “small” data. Thus, organiza- tions can gain greater value by mining large data volumes than small ones. While users can detect the patterns in small data sets using simple statistical ARCHITECTURE methods, ad hoc query and analysis tools or by eyeballing the data, they need FOR BIG DATA ANALYTICS sophisticated techniques to mine big data. Fortunately, these techniques and tools already exist thanks to companies such as SAS Institute and SPSS (now part of IBM) that ship analytical workbenches (i.e., data mining tools). PLATFORMS FOR These tools incorporate all kinds of analytical algorithms that have been de- RUNNING BIG DATA ANALYTICS veloped and refined by academic and commercial researchers over the past 40 years. PROFILING THE USE I Real-time data. Organizations that accumulate big data recognize quickly OF ANALYTICAL PLATFORMS that they need to change the way they capture, transform and move data from a nightly batch process to a continuous process using micro batch loads or event-driven updates. This technical constraint pays big business dividends RECOMMENDA- because it makes it possible to deliver critical information to users in near real TIONS time. In other words, big data fosters operational analytics by supporting just- in-time information delivery. The market today is witnessing a perfect storm with the convergence of big data, deep analytics and real-time information delivery. BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 10
  • 11.
    BIG DATA ANALYTICS:DERIVING VALUE FROM BIG DATA I Complex analytics. In addition, during the past 15 years, the “analytical IQ” of many organizations has evolved from reporting and dashboarding to light- weight analysis conducted with query and online analytical processing (OLAP) tools. Many organizations are now on the verge of upping their analytical IQ by implementing complex analytics against both structured and unstruc- EXECUTIVE tured data. Complex analytics spans a SUMMARY vast array of techniques and applica- tions. Traditional analytical workbenches Analytics increases RESEARCH from SAS and SPSS create mathematical corporate intelligence. … BACKGROUND models of historical data that can be and is the only true used to predict future behavior. This type source of sustainable of predictive analytics can be used to do WHY BIG DATA? everything from delivering highly tailored advantage. cross-sell recommendations to predict- BIG DATA ing failure rates of aircraft engines. In ANALYTICS: addition, organizations are now applying DERIVING VALUE FROM BIG DATA a variety of complex analytics to Web, social media and other forms of com- plex structured data that are hard to do with traditional SQL-based tools, including path analysis, graph analysis, link analysis, fuzzy matching and ARCHITECTURE so on. FOR BIG DATA ANALYTICS Organizations are now recruiting analysts who know how to wield these analytical tools to unearth the hidden value in big data. They are hiring analyti- cal modelers who know how to use data mining workbenches, as well as data PLATFORMS FOR scientists, application developers with process and data knowledge who write RUNNING BIG DATA ANALYTICS programming code to run against large Hadoop clusters. I Sustainable advantage. At the same time, executives have recognized the PROFILING THE USE power of analytics to deliver a competitive advantage, thanks to the pioneer- OF ANALYTICAL PLATFORMS ing work of thought leaders such as Tom Davenport, who co-wrote the book Competing on Analytics: The New Science of Winning. In fact, forward-thinking executives recognize that analytics may be the only true source of sustainable RECOMMENDA- advantage since it empowers employees at all levels of an organization with TIONS information to help them make smarter decisions. In essence, analytics increases corporate intelligence, which is something you can never package or systematize and competitors can’t duplicate. In short, many organizations have laid the groundwork to reap the benefits of “big data analytics.” BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 11
  • 12.
    BIG DATA ANALYTICS:DERIVING VALUE FROM BIG DATA A FRAMEWORK FOR SUCCESS However, the road to big data analytics is not easy and success is not guaran- teed. Analytical champions are still rare. That’s because succeeding with big data analytics requires the right culture, people, organization, architecture and technology (see Figure 6). EXECUTIVE 1. The right culture. Analytical organizations are championed by executives SUMMARY who believe in making fact-based decisions or validating intuition with data. These executives create a culture of performance measurement in which indi- RESEARCH viduals and groups are held accountable for the outcomes of predefined met- BACKGROUND rics aligned with strategic objectives. These leaders recruit other executives who believe in the power of data and are willing to invest money and their own time to create a learning organization that runs by the numbers and uses ana- WHY BIG DATA? lytical techniques to exploit big data. BIG DATA 2. The right people. You can’t do big data analytics without power users, or ANALYTICS: DERIVING VALUE FROM BIG DATA Figure 6: Big data analytics framework ARCHITECTURE FOR BIG DATA ANALYTICS PLATFORMS FOR RUNNING BIG DATA ANALYTICS PROFILING THE USE OF ANALYTICAL PLATFORMS RECOMMENDA- TIONS BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 12
  • 13.
    BIG DATA ANALYTICS:DERIVING VALUE FROM BIG DATA more specifically, business analysts, analytical modelers and data scientists. These folks possess a rare combination of skills and knowledge: They have a deep understanding of business processes and the data that sits behind those processes and are skillful in the use of various analytical tools, including Excel, SQL, analytical workbenches and coding languages. They are highly motivat- ed, critical thinkers who command an above-average salary and exhibit a pas- EXECUTIVE sion for success and deliver outsized value to the organization. SUMMARY 3. The right organization. Historically, analysts with the aforementioned skills RESEARCH were pooled in pockets of an organization hired by department heads. But BACKGROUND analytical champions create a shared service organization (i.e., an analytical center of excellence) that makes analytics a pervasive competence. Analysts are still assigned to specific departments and processes, but they are also part WHY BIG DATA? of a central organization that provides collaboration, camaraderie and a career path for analysts. At the same time, the director maintains a close relationship BIG DATA with the data warehousing team (if he doesn’t own the function outright) to ANALYTICS: ensure that business analysts have open access to the data they need to do DERIVING VALUE FROM BIG DATA their jobs. Data is fuel for a business analyst or data scientist. 4. The right architecture. The data warehousing team plays a critical role in ARCHITECTURE delivering deep analytics. It needs to establish an architecture that ensures the FOR BIG DATA ANALYTICS delivery of high-quality, secure, consistent information while providing open access to those who need it. Threading this needle takes wisdom, a good deal of political astuteness and a BI-savvy data architecture team. The architecture PLATFORMS FOR itself must be able to consume large volumes of structured and unstructured RUNNING BIG DATA ANALYTICS data and make it available to different classes of users via a variety of tools (see “Architecture for Big Data Analytics” below). PROFILING THE USE 5. Analytical platform. At the heart of an analytical infrastructure is an analyt- OF ANALYTICAL PLATFORMS ical platform, the underlying data management system that consumes, inte- grates and provides user access to information for reporting and analysis activities. Today, many vendors, including most of the sponsors of this report, RECOMMENDA- provide specialized analytical platforms that provide dramatically better query TIONS performance than existing systems. There are many types of analytical plat- forms sold by dozens of vendors (see “Types of Analytical Platforms” below). I BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 13
  • 14.
    ARCHITECTURE FOR BIGDATA ANALYTICS Architecture for Big Data Analytics EXECUTIVE IF BIG DATA is simply a continuation of longstanding data trends, does it change SUMMARY the way organizations architect and deploy data warehousing environments? Big data analytics doesn’t change data warehousing or BI architectures; it sim- RESEARCH ply supplements them with new technologies and access methods better tai- BACKGROUND lored to meeting the information requirements of business analysts and data scientists. WHY BIG DATA? I Top down. For the past 15 years, BI teams have built data warehouses that serve the information needs of casual users (e.g., executives, managers and BIG DATA front-line staff.) These top-down, report-driven environments require develop- ANALYTICS: ers to know in advance what kinds of questions casual users want to ask and DERIVING VALUE FROM BIG DATA which metrics they want to monitor. With requirements in hand, developers create a data warehouse model, build extract, transform and load (ETL) rou- tines to move data from source systems to the data warehouse, and then create ARCHITECTURE reports and dashboards to query the data warehouse (see Figure 7, page 15). FOR BIG DATA ANALYTICS Whether by choice or not, power users who operate in an exclusively top- down BI environment are largely left to fend for themselves, using spread- sheets, desktop databases, SQL and data-mining workbenches. Business ana- PLATFORMS FOR lysts generally find BI tools too inflexible and data warehousing data too RUNNING BIG DATA ANALYTICS limited. At best, they might use BI tools as glorified extract engines to dump data into Microsoft Excel, Access or some other analytical environment. The upshot is that these analysts and data scientists generally spend an inordinate PROFILING THE USE amount of time preparing data instead of analyzing it and create hundreds if OF ANALYTICAL PLATFORMS not thousands of data silos that wreak havoc on information consistency from a corporate perspective. RECOMMENDA- I Bottom-up. Business analysts and data scientists need a different type of TIONS analytical environment, one that caters to their needs. This is a bottom-up en- vironment that fosters ad hoc exploration of any data source, both inside and outside corporate boundaries, and minimizes the need for analysts to create data silos. Here, business analysts don’t know what questions they need to BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 14
  • 15.
    ARCHITECTURE FOR BIGDATA ANALYTICS answer in advance because they are usually responding to emergency re- quests from executives and managers who need information to address new and unanticipated events in the marketplace. Rather than focus on goals and metrics, business analysts spend most of their time engaged in ad hoc projects, or they work closely with business managers to optimize existing processes. As you can see, there is a world of difference between a top-down and EXECUTIVE bottom-up BI environment. Many organizations have tried to support both SUMMARY types of processing within a single BI environment. But that no longer works in the age of big data analytics. Forward-thinking companies are expanding RESEARCH their data warehousing architectures and data governance programs to better BACKGROUND balance the dynamic between top-down and bottom-up requirements. (See Analytic Architectures: Approaches to Supporting Analytics Users and Workloads, a 40-page report by Wayne Eckerson, available for free download.) WHY BIG DATA? BIG DATA ANALYTICS: Figure 7: Top-down vs. bottom-up BI DERIVING VALUE Top-down and bottom-up BI environments are distinct, but complementary, FROM BIG DATA environments, but most organizations try to shoehorn both into a single architecture. ARCHITECTURE FOR BIG DATA ANALYTICS PLATFORMS FOR RUNNING BIG DATA ANALYTICS PROFILING THE USE OF ANALYTICAL PLATFORMS RECOMMENDA- TIONS BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 15
  • 16.
    ARCHITECTURE FOR BIGDATA ANALYTICS NEXT-GENERATION BI ARCHITECTURE Figure 8 represents the next-generation BI architecture, which blends ele- ments from top-down and bottom-up BI into a single cohesive environment that adequately supports both casual and power users. The top half of the dia- gram represents the classic top-down, data warehousing architecture that pri- marily delivers interactive reports and dashboards to casual users (although EXECUTIVE the streaming/complex event processing (CEP) engine is new.) The bottom SUMMARY half of the diagram adds new architectural elements and data sources that better accommodate the needs of business analysts and data scientists and RESEARCH make them full-fledged members of the corporate data environment. BACKGROUND WHY BIG DATA? Figure 8: The new BI architecture The next-generation BI architecture is more analytical, giving power users BIG DATA ANALYTICS: greater options to access and mix corporate data with their own data via various DERIVING VALUE types of analytical sandboxes. It also brings unstructured and semi-structured FROM BIG DATA data fully into the mix using Hadoop and nonrelational databases. ARCHITECTURE FOR BIG DATA ANALYTICS PLATFORMS FOR RUNNING BIG DATA ANALYTICS PROFILING THE USE OF ANALYTICAL PLATFORMS RECOMMENDA- TIONS BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 16
  • 17.
    ARCHITECTURE FOR BIGDATA ANALYTICS SERVER ENVIRONMENT I Hadoop. The biggest change in the new BI architecture is that the data ware- house is no longer the centerpiece. It now shares the spotlight with systems that manage structured and unstructured data. The most popular among these is Hadoop, an open source software framework for building data-inten- sive applications. Following the example of Internet pioneers, such as Google, EXECUTIVE Amazon and Yahoo, many companies now use Hadoop to store, manage and SUMMARY process large volumes of Web data. Hadoop runs on the Hadoop Distributed RESEARCH File System (HDFS), a distributed file sys- BACKGROUND tem that scales out on commodity servers. The biggest change in Since Hadoop is file-based, developers the new BI architecture don’t need to create a data model to store WHY BIG DATA? or process data, which makes Hadoop ideal is that the data for managing semi-structured Web data, warehouse is no longer BIG DATA which comes in many shapes and sizes. But the centerpiece. ANALYTICS: because it is “schema-less,” Hadoop can be DERIVING VALUE FROM BIG DATA used to store and process any kind of data, including structured transactional data and unstructured audio and video data. However, the biggest advantage of Hadoop ARCHITECTURE right now is that it’s open source, which means that the up-front costs of FOR BIG DATA ANALYTICS implementing a system to process large volumes of data are lower than for commercial systems. However, Hadoop does require companies to purchase and manage dozens, if not hundreds, of servers and train developers and PLATFORMS FOR administrators to use this new technology. RUNNING BIG DATA ANALYTICS I Data warehouse integration. Today some companies use Hadoop as a staging area for unstructured and semi-structured data (e.g., Web traffic) PROFILING THE USE before loading it into a data warehouse. These companies keep the atomic OF ANALYTICAL PLATFORMS data in Hadoop and push lightly summarized data sets to the data warehouse or nonrelational systems for reporting and analysis. However, some compa- nies let power users with appropriate skills query raw data in Hadoop. RECOMMENDA- For example, LiveRail, an online video advertising service provider, follows TIONS the lead of most Internet providers and uses both Hadoop and a data ware- house to support a range of analytical needs. LiveRail keeps its raw Web data from video campaigns in Hadoop and summarized data about those cam- paigns in Infobright, a commercial, open source columnar database. Business BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 17
  • 18.
    ARCHITECTURE FOR BIGDATA ANALYTICS users, including LiveRail’s customers who want to check on the performance of their campaigns, run ad hoc queries and reports against Infobright, while developers who need to access the raw data use Hive to schedule and run reports against Hadoop. “Since Hadoop isn’t interactive, we needed a fast database that scales to support our ad hoc environment, which is why we chose Infobright,” said EXECUTIVE Andrei Dunca, chief technology officer at LiveRail. SUMMARY I Use cases for Hadoop. According to a new report by Ventana Research RESEARCH titled Hadoop and Information Management: Benchmarking the Challenge of BACKGROUND Enormous Volumes of Data, Hadoop is more likely to be used than traditional data management systems for three purposes: to “perform types of analytics that couldn’t be done on large volumes of data before,” “capture all the source WHY BIG DATA? data we are collecting (pre-process)” and “keep more historical data (post- process).” (See Figure 9.) BIG DATA Once data lands in Hadoop, whether it’s Web data or not, organizations ANALYTICS: have several options: DERIVING VALUE FROM BIG DATA ARCHITECTURE Figure 9: Role of Hadoop FOR BIG DATA ANALYTICS Analyze data at a greater PLATFORMS FOR level of detail RUNNING BIG DATA ANALYTICS Perform types of analytics that couldn’t be done on large volumes of data before PROFILING THE USE OF ANALYTICAL PLATFORMS Keep more historical data (post-process) RECOMMENDA- Capture all of the source I Hadoop TIONS data that we are collecting I Non- (pre-process) Hadoop 0 10 20 30 40 50 60 70 80 90 100 SOURCE: HADOOP AND INFORMATION MANAGEMENT: BENCHMARKING THE CHALLENGE OF ENORMOUS VOLUMES OF DATA: EXECUTIVE SUMMARY, VENTANA RESEARCH, JUNE 23, 2011. BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 18
  • 19.
    ARCHITECTURE FOR BIGDATA ANALYTICS R Create an online archive. With Hadoop, organizations don’t have to delete or ship the data to offline storage; they can keep it online indefinitely by adding commodity servers to meet storage and processing requirements. Hadoop becomes a low-cost alternative for meeting online archival requirements. EXECUTIVE R Feed the data warehouse. Organizations can also use Hadoop to parse, SUMMARY integrate and aggregate large volumes of Web or other types of data and then ship it to the data warehouse, where both casual and power users can RESEARCH query and analyze the data using familiar BI tools. Here, Hadoop becomes BACKGROUND an ETL tool for processing large volumes of Web data before it lands in the corporate data warehouse. WHY BIG DATA? R Support analytics. The big data crowd (i.e., Internet developers) views Hadoop primarily as an analytical engine for running analytical computa- BIG DATA tions against large volumes of data. To query Hadoop, analysts currently ANALYTICS: need to write programs in Java or other languages and understand DERIVING VALUE FROM BIG DATA MapReduce, a framework for writing distributed (or parallel) applications. The advantage here is that analysts aren’t restricted by SQL when formu- lating queries. SQL does not support many types of analytics, especially ARCHITECTURE those that involve inter-row calculations, which are common in Web traffic FOR BIG DATA ANALYTICS analysis. The disadvantage is that Hadoop is batch-oriented and not con- ducive to iterative querying. PLATFORMS FOR R Run reports. Hadoop’s batch-orientation, however, makes it suitable for RUNNING BIG DATA ANALYTICS executing regularly scheduled reports. Rather than running reports against summary data, organizations can now run them against raw data, guaran- teeing the most accurate results. PROFILING THE USE OF ANALYTICAL PLATFORMS I Nonrelational databases. While Hadoop has received a lot of press atten- tion lately, it’s not the only game in town for storing and managing semi-struc- tured data. In fact, an emerging and diverse set of products goes one step fur- RECOMMENDA- ther than Hadoop and stores both structured and unstructured data within a TIONS single index. These so-called nonrelational databases (depicted in Figure 8 supporting a “free-standing sandbox”) typically extract entities from docu- ments, files and other databases using natural language processing tech- niques and index them as key value pairs for quick retrieval using a document- BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 19
  • 20.
    ARCHITECTURE FOR BIGDATA ANALYTICS centric query language such as XQuery. As a result, these products can give users one place to go to query both structured and unstructured data. This style of analysis, which some call “unified information access,” exhibits many search-like characteristics. But instead of returning a list of links, the systems return qualified data sets or reports in response to user queries. And unlike Hadoop, the systems are interactive, allowing users to submit queries in EXECUTIVE an iterative fashion so they can under- SUMMARY stand trends and issues. These nonrelational systems comple- RESEARCH ment Hadoop, an enterprise data ware- Nonrelational systems BACKGROUND house or both. For example, organiza- tions might use Hadoop to transcribe can store both structured audio files and then load the transcrip- and unstructured data WHY BIG DATA? tions into a nonrelational database for within a single index, analysis. Or they might replicate sales and customer data from a data ware- giving users one place to BIG DATA ANALYTICS: house and combine it with Web data in a query any type of data. DERIVING VALUE FROM BIG DATA nonrelational database so power users can find correlations between Web traffic and customer orders without bogging ARCHITECTURE down performance of the data warehouse with complex queries. This type of FOR BIG DATA ANALYTICS unified information access is critical in a growing number of applications. For example, an oil and gas company uses MarkLogic to track the location of ships at sea. The MarkLogic Server stores data from GPS, news feeds, weather PLATFORMS FOR data, commodity prices, among other things, and surfaces all this data on a RUNNING BIG DATA ANALYTICS map that users can query. For example, a user might ask, “Show me all the ships within this polygon (i.e., geographic area) that are carrying this type of oil and have changed course since leaving the port of origin.” The application PROFILING THE USE then displays the results on the map. OF ANALYTICAL PLATFORMS I Data warehouse hubs. While Hadoop and nonrelational systems primarily manage semi-structured and unstructured data, the data warehouse manages RECOMMENDA- structured data from run-the-business operational systems. Except for Terada- TIONS ta shops, many companies increasingly use data warehouses running on tradi- tional relational databases as hubs to feed other systems and applications rather than to host reporting and analysis applications. For example, Dow Chemical, which maintains a large SAP Business Ware- BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 20
  • 21.
    ARCHITECTURE FOR BIGDATA ANALYTICS house (BW) data warehouse, now runs all queries against virtual cubes that run in memory using SAP BW Accelerator. “By running our cubes in memory, we’ve de-bottlenecked our data warehouse,” said Mike Masciandaro, director of business intelligence at Dow. “Now our data warehouse primarily manages batch loads to stage data.” Likewise, Blue Cross EXECUTIVE Blue Shield of Kansas City has trans- Many companies SUMMARY formed its IBM DB2 data warehouse into increasingly use their data a hub that feeds transaction and analyti- cal systems and implemented a Teradata warehouse as hubs to RESEARCH BACKGROUND Data Warehouse Appliance 2650 to han- feed other systems and dle all reports and queries and support a applications rather than self-service BI environment. WHY BIG DATA? as targets for reporting I Analytical sandboxes. In keeping with and analysis applications. BIG DATA its role as a hub, an enterprise data ANALYTICS: warehouse at many organizations now DERIVING VALUE FROM BIG DATA distributes data to analytical sandboxes that are designed to wean business analysts and data scientists off data shad- ow systems and make them full-fledged consumers of the corporate data ARCHITECTURE infrastructure. There are four types of analytical sandboxes: FOR BIG DATA ANALYTICS R Hadoop. Hadoop can be considered an analytical sandbox for Web data that developers with appropriate skills can access to run complex queries PLATFORMS FOR and calculations. Rather than analyze summarized or transformed data in RUNNING BIG DATA ANALYTICS a data warehouse, developers can run calculations and models against the raw, atomic data. PROFILING THE USE R Virtual DW sandbox. A virtual data warehouse sandbox is a partition, or OF ANALYTICAL PLATFORMS set of tables, inside the data warehouse, dedicated to individual analysts. Rather than create a spreadmart, analysts upload their data into a parti- tion and combine it with data from the data warehouse that is either RECOMMENDA- “pushed” to the partition by the BI team using ETL processes or “pulled” TIONS by analysts through queries. The BI team carefully allocates compute resources so analysts have enough horsepower to run ad hoc and complex queries without interfering with other workloads running on the data warehouse. BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 21
  • 22.
    ARCHITECTURE FOR BIGDATA ANALYTICS R Free-standing sandbox. A free-standing sandbox is a separate system from the data warehouse with its own server, storage and database that is designed to support complex, analytical queries. In some cases, it is an analytical platform that runs complex queries against a replica of the data warehouse. In other cases, it runs on a nonrelational database and con- tains an entirely new set of data (e.g., Web logs or sensor data) that either EXECUTIVE doesn’t fit in the data warehouse because of space constraints or is SUMMARY processed more efficiently in a nontraditional platform. Occasionally, it mixes both corporate and departmental data into a local data mart that RESEARCH can run on-premises or off-site in a hosted environment. In all cases, it BACKGROUND provides a dedicated environment for a targeted group of analytically minded users. WHY BIG DATA? R In-memory BI sandbox. Some deskop BI tools, such as QlikView or Power- Pivot, maintain a local data store in memory to support interactive dash- BIG DATA boards or ad hoc queries. These sandboxes are popular among analysts, ANALYTICS: because they generally let them pull data from any source, quickly link DERIVING VALUE FROM BIG DATA data sets, run super-fast queries against data held in memory, and visually interact with the results, all without much or any IT intervention. Also, some server-based environments, such as SAP HANA, store all data in ARCHITECTURE memory, accelerating queries for all types of BI users. FOR BIG DATA ANALYTICS I Streaming/CEP Engine. The top-down environment picks one important new architectural feature, streaming and CEP engines. Designed to support PLATFORMS FOR continuous intelligence, CEP engines are designed to ingest large volumes of RUNNING BIG DATA ANALYTICS discrete events in real-time, calculate or correlate those events, enrich them with historical data if needed, and apply rules that notify users when specific types of activity or anomalies occur. For example, these engines are ideal for PROFILING THE USE detecting fraud in a stream of thousands of transactions per second. OF ANALYTICAL PLATFORMS These rules-driven systems are like intelligent sensors that organizations can attach to streams of transaction data to watch for meaningful combina- tions of events or trends. In essence, CEP systems are sophisticated notifica- RECOMMENDA- tion systems designed to monitor real-time events. They are ideal for monitor- TIONS ing continuous operations, such as supply chains, transportation operations, factory floors, casinos, hospital emergency rooms, Web-based gaming sys- tems and customer contact centers. Streaming engines are similar to CEP engines but are designed to handle BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 22
  • 23.
    ARCHITECTURE FOR BIGDATA ANALYTICS enormous volumes of a single discrete event type, such as a sensor data gen- erated by a pipeline or medical device. Streaming engines typically ingest an order of magnitude more events per second than CEP engines but typically only pull data from a single source. However, streaming and CEP engines are merging in functionality as vendors seek to offer one-stop shopping for contin- uous intelligence capabilities. EXECUTIVE SUMMARY CLIENT ENVIRONMENT RESEARCH I Casual users. The front-end of the new BI architecture remains relatively BACKGROUND unchanged for casual users, who continue to use reports and dashboards run- ning against dependent data marts (either logical or physical) fed by a data warehouse. This environment typically meets 60% to 80% of their informa- WHY BIG DATA? tion needs, which can be defined up-front through requirements-gathering exercises. Predefined reports and dashboards are designed to answer ques- BIG DATA tions tailored to individual roles within ANALYTICS: the organization. DERIVING VALUE FROM BIG DATA However, meeting the ad hoc needs of casual users continues to be a prob- Search-based exploration lem. Interactive reports and dash- ARCHITECTURE boards help to some degree, but casual tools that allow users FOR BIG DATA ANALYTICS users today still rely on the IT depart- to type queries in plain ment or “super users”—tech-savvy English and refine their business colleagues—to create ad hoc PLATFORMS FOR reports and views on their behalf. search using facets or RUNNING BIG DATA ANALYTICS Search-based exploration tools that categories offer significant allow users to type queries in plain Eng- promise but are not yet lish and refine their search using facets PROFILING THE USE or categories offer significant promise mainstream technology. OF ANALYTICAL PLATFORMS but are not yet mainstream technology. One new addition to the casual user environment are dashboards powered RECOMMENDA- by streaming/CEP engines. While these operational dashboards are primarily TIONS used by operational analysts and workers, many executives and managers are keen to keep their fingers on the pulse of their companies’ core processes by accessing these “twinkling” dashboards directly or, more commonly, receiving alerts from these systems inside existing BI environments. BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 23
  • 24.
    ARCHITECTURE FOR BIGDATA ANALYTICS I Power users. The biggest change in the new analytical BI architecture is how it accommodates the information needs of power users. It gives power users many new options for consuming corporate data rather than creating count- less spreadmarts. A power user is a person whose job is to crunch data on a daily basis to generate insights and plans. Power users include business ana- lysts (e.g., Excel jockeys), analytical modelers (e.g., SAS programmers and EXECUTIVE statisticians) and data scientists (e.g., application developers with business SUMMARY process and database expertise.) Power users have five options as depicted in Figure 8: RESEARCH BACKGROUND R Query a virtual sandbox. Rather than spend valuable time creating spreadmarts, power users can leverage the processing power and data of the data warehouse by using a virtual sandbox. Here, they can upload WHY BIG DATA? their own data to the data warehouse, mix it with corporate data and per- form their analyses. However, if they want to share something they’ve built BIG DATA in their sandboxes, they need to hand over their analyses to the BI team to ANALYTICS: turn it into production applications. DERIVING VALUE FROM BIG DATA R Query a free-standing sandbox. Power users can also query a free-stand- ing sandbox created for their benefit. In most cases, the system is tuned to ARCHITECTURE support ad hoc queries and analytical modeling activities against a replica FOR BIG DATA ANALYTICS of the data warehouse or another data set designed for power users. R Query a BI sandbox. Power users can download data from a data ware- PLATFORMS FOR house or other source into a local BI tool and interact with the data in RUNNING BIG DATA ANALYTICS memory at the speed of thought. While these sandboxes have the poten- tial to become spreadmarts, new analytical tools usually bake in server environments that encourage, if not require, power users to publish their PROFILING THE USE analyses to an IT-controlled environment. Many also are starting to come OF ANALYTICAL PLATFORMS up with collaboration capabilities that encourage reuse and minimize the proliferation of data silos. RECOMMENDA- R Query the data warehouse. Some BI teams give permission to a handful of TIONS trusted power users to directly query the data warehouse or DW staging area. This requires that analysts have a deep understanding of the raw data and advanced knowledge of SQL to avoid creating runaway queries or generating incorrect results. BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 24
  • 25.
    ARCHITECTURE FOR BIGDATA ANALYTICS R Query Hadoop. If power users want to analyze big data in its raw or lightly aggregated form, they can query Hadoop directly by writing MapReduce code in a variety of languages. However, power users must know how to write parallel queries and interrogate the structure of the data prior to querying it since Hadoop data is schema-less. Vendors are also beginning to ship BI tools that access Hadoop through Hive or Hbase and return data EXECUTIVE sets to the BI tool. SUMMARY I Data integration. The new BI architecture also places a premium on man- RESEARCH aging and manipulating data flows between systems. This calls for a versatile BACKGROUND set of data integration tools that can access any type of data (e.g., structured, semi-structured and unstructured), load it into any target (e.g., Hadoop, data warehouse or in-memory database), WHY BIG DATA? navigate data sources that exist on- premises or in the cloud, work in BIG DATA batch and real time, and handle both Data integration products ANALYTICS: small and large volumes of data. that run on both relational DERIVING VALUE FROM BIG DATA Data integration tools for Hadoop are in their infancy but evolving fast. and nonrelational platforms The open source community has and maintain a consistent ARCHITECTURE developed Flume, a scalable distrib- set of metadata across FOR BIG DATA uted file system that collects, aggre- ANALYTICS both environments will gates and loads data into the HDFS. But longtime data integration ven- reduce overall training PLATFORMS FOR RUNNING BIG DATA dors, such as Informatica, are also and maintenance costs. ANALYTICS converting their visual design tools to interoperate with Hadoop. That way, ETL developers can use familiar PROFILING THE USE tools to extract, load, parse, integrate, cleanse and match data in Hadoop by OF ANALYTICAL PLATFORMS generating MapReduce code under the covers. In this respect, Hadoop is both another data source for ETL tools as well as a new data processing engine geared to handling semi-structured and unstruc- RECOMMENDA- tured data. Data integration products that run on both relational and nonrela- TIONS tional platforms and maintain a consistent set of metadata across both envi- ronments will reduce overall training and maintenance costs. I Analytical services. Although it doesn’t happen often, a growing number of BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 25
  • 26.
    ARCHITECTURE FOR BIGDATA ANALYTICS companies are outsourcing analytical applications to a third-party provider and occasionally an entire data warehouse. The most popular applications to outsource to a third party are test, development and prototyping applications. Often, IT administrators will provision servers on demand from a public cloud to support these types of applications. However, if an organization wants a permanent online presence to support an analytical sandbox or data ware- EXECUTIVE house, it will subscribe to a private hosted service, which provides higher lev- SUMMARY els of guaranteed availability and performance compared with a public cloud. In either case, the motivation to implement an analytical service is straight- RESEARCH forward: An analytical service requires minimal IT involvement and up-front BACKGROUND capital, making it easy, quick and painless to get up and running quickly. Ironi- cally, few companies think of outsourcing their analytical environments when exploring options. WHY BIG DATA? I Dollar General. That was the case with Dollar General, a discount retailer BIG DATA that wanted to purchase an analytical system to supplement its Oracle data ANALYTICS: warehouse, which could not store atomic-level point-of-sale (POS) data from DERIVING VALUE FROM BIG DATA its 9,500 stores nationwide. With a reference from a consumer products part- ner, Dollar General decided to implement an analytical platform from services provider, 1010data. The product, which is accessed via a Web browser, offers ARCHITECTURE an Excel-like interface that provides native support for time-series data and FOR BIG DATA ANALYTICS analytical functions. Within five weeks, Dollar General was running daily reports against atomic-level POS data, according to Sandy Steier, executive vice president and co-founder of 1010data. PLATFORMS FOR A year later, Dollar General decided to replace its Oracle data warehouse RUNNING BIG DATA ANALYTICS and conducted a proof of concept with several leading analytical platform providers. 1010data, which participated in the Bake-Off, demonstrated superi- or performance and now runs Dollar General’s entire data warehouse. PROFILING THE USE And while Dollar General didn’t set out to purchase an analytical service, OF ANALYTICAL PLATFORMS that proved a smart move. Besides quick deployment times and reduced inter- nal maintenance costs, the analytical service made it easier for Dollar General to open up its data warehouse to suppliers, which now use it to track sales and RECOMMENDA- make recommendations for product placement and promotions, Steier said. I TIONS BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 26
  • 27.
    PLATFORMS FOR RUNNINGBIG DATA ANALYTICS Platforms for Running Big Data Analytics EXECUTIVE SUMMARY SINCE THE BEGINNING of the data warehousing movement in the early 1990s, RESEARCH organizations have used general-purpose data management systems to imple- BACKGROUND ment data warehouses and, occasionally, multidimensional databases (i.e., “cubes”) to support subject-specific data marts, especially for financial analyt- ics. General-purpose data management systems were designed for transac- WHY BIG DATA? tion processing (i.e., rapid, secure, syn- chronized updates against small data BIG DATA sets) and only later modified to handle ANALYTICS: analytical processing (i.e., complex Analytical platforms focus DERIVING VALUE FROM BIG DATA queries against large data sets.) In con- entirely on analytical trast, analytical platforms focus entirely processing at the expense on analytical processing at the expense ARCHITECTURE of transaction processing.1 of transaction processing. FOR BIG DATA ANALYTICS I The analytical platform movement. In 2002, Netezza (now owned by IBM), PLATFORMS FOR introduced a specialized analytical appliance, a tightly integrated, hardware- RUNNING BIG DATA ANALYTICS software database management system designed explicitly to run ad hoc queries at blindingly fast speeds. Netezza’s success spawned a host of com- petitors, and there are now more than two dozen players in the market. The PROFILING THE USE value of this new analytical technology didn’t escape the notice of the world’s OF ANALYTICAL PLATFORMS biggest software vendors, each of whom has made a major investment in the space, either through organic development or an acquisition (see Table 1). To be accurate, Netezza wasn’t the first mover in the market, but it came RECOMMENDA- along at the right time. In the mid-2000s, many corporate BI teams were TIONS 1 Like most things, there are exceptions to this rule. For example, Oracle Exadata runs on Oracle 10g and, as such, it supports both transactional and analytical processing, often with superior performance in both realms compared with standard Oracle 10g installations. BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 27
  • 28.
    PLATFORMS FOR RUNNINGBIG DATA ANALYTICS Table 1: Types of analytical platforms (Companies in parentheses recently acquired the preceding product or company) TECHNOLOGY DESCRIPTION VENDOR/PRODUCT Massively parallel Row-based databases designed to Teradata Active Data Warehouse, processing analytic scale out on a cluster of commodity Greenplum (EMC), Microsoft EXECUTIVE databases servers and run complex queries in Parallel Data Warehouse, Aster SUMMARY parallel against large volumes of Data (Teradata), Kognitio, Dataupia data. RESEARCH Columnar Database management systems ParAccel, Infobright, Sand BACKGROUND databases that store data in columns, not rows, Technology, Sybase IQ (SAP), and support high data compression Vertica (Hewlett-Packard), ratios. 1010data, Exasol, Calpont WHY BIG DATA? Analytical Preconfigured hardware-software Netezza (IBM), Teradata Appli- appliances systems designed for query ances, Oracle Exadata, Greenplum BIG DATA processing and analytics that Data Computing Appliance (EMC) ANALYTICS: require little tuning. DERIVING VALUE FROM BIG DATA Analytical bundles Predefined hardware and software IBM SmartAnalytics, Microsoft configurations that are certified to FastTrack meet specific performance criteria, but customers must purchase and ARCHITECTURE configure themselves. FOR BIG DATA ANALYTICS In-memory Systems that load data into memory SAP HANA, Cognos TM1 (IBM), databases to execute complex queries. QlikView, Membase PLATFORMS FOR RUNNING BIG DATA Distributed file- Distributed file systems designed Hadoop (Apache, Cloudera, MapR, ANALYTICS based systems for storing, indexing, manipulating IBM, HortonWorks), Apache Hive, and querying large volumes of un- Apache Pig structured and semi-structured data. PROFILING THE USE Analytical services Analytical platforms delivered 1010data, Kognitio OF ANALYTICAL as hosted or public-cloud-based PLATFORMS services. Nonrelational Nonrelational databases optimized MarkLogic Server, MongoDB, RECOMMENDA- for querying unstructured data as Splunk, Attivio, Endeca, Apache TIONS well as structured data. Cassandra, Apache Hbase CEP/streaming Ingest, filter, calculate and correlate IBM, Tibco, Streambase, engines large volumes of discrete events and Sybase (Aleri), Opalma, Vitria, apply rules that trigger alerts when Informatica conditions are met. BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 28
  • 29.
    PLATFORMS FOR RUNNINGBIG DATA ANALYTICS struggling to deliver reasonable query performance using general-purpose data management systems as the volumes of data and numbers of users and applications running against their data warehouses exploded. Netezza offered a convenient way to offload long-running or complex queries from data ware- houses and satisfy the needs of business analysts. Yet long before Netezza shipped its first box, in the 1980s, Teradata delivered the first massively paral- EXECUTIVE lel database management system geared to analytical processing, and Sybase SUMMARY shipped the first columnar database in the 1990s. Both of these products now have thousands of customers and, in this respect, can be considered front- RESEARCH runners in the so-called analytical platform market. BACKGROUND Today, the technology behind analytical platforms is diverse: appliances, columnar databases, in memory databases, massively parallel processing (MPP) databases, file-based systems, nonrelational databases and analytical WHY BIG DATA? services. What they all have in common, however, is that they provide signifi- cant improvements in price-performance, availability, load times and man- BIG DATA ageability compared with general-purpose relational database management ANALYTICS: systems. Every analytical platform customer I’ve interviewed has cited order- DERIVING VALUE FROM BIG DATA of-magnitude performance gains that most initially don’t believe. “The per- formance is blinding; just amazing,” says Masciandaro of Dow. ARCHITECTURE IAnalytical Techniques. Analytical platforms offer superior price-perfor- FOR BIG DATA ANALYTICS mance for many reasons. And while product architectures vary considerably, most support the following characteristics: PLATFORMS FOR R MPP. Most analytical platforms spread data across multiple nodes, each RUNNING BIG DATA ANALYTICS containing its own CPU, memory and storage and connected to a high- speed backplane. When a user submits a query or runs an application, the “shared nothing” system divides the work across the nodes, each of which PROFILING THE USE process the query on its piece of the data and ship the results to a master OF ANALYTICAL PLATFORMS node that assembles the final result and sends it to the user. MPP systems are highly scalable, since you simply add nodes to increase processing power. And if the nodes run on commodity servers, as many MPP systems RECOMMENDA- today do, then this scalability is more cost-effective than MPP systems TIONS running on proprietary hardware or symmetric multiprocessing systems, which require big, expensive boxes to scale. R Balanced configurations. Analytical platforms optimize the configuration BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 29
  • 30.
    PLATFORMS FOR RUNNINGBIG DATA ANALYTICS of CPU, memory and disk for query processing rather than transaction processing. Analytical appliances essentially “hard wire” this configuration into the system and don’t let customers change it, whereas analytical bun- dles or analytical databases (i.e., software-only solutions) allow customers to configure the underlying hardware to match unique application require- ments. Analytical appliances offer convenience and ease of use while ana- EXECUTIVE lytical databases offer flexibility. SUMMARY R Storage-level processing. Netezza’s big innovation was to move some RESEARCH database functions, specifically data filtering functions, into the storage BACKGROUND system using field programmable gate arrays. This storage-level filtering reduces the amount of data that the DBMS has to process, which signifi- cantly increases query performance. Many vendors have followed suit, WHY BIG DATA? moving various databases functions into hardware. In fact, Kickfire (pur- chased by Teradata) runs all SQL functions on a chip to accelerate query BIG DATA processing. ANALYTICS: DERIVING VALUE FROM BIG DATA R Columnar storage and compression. Many vendors have followed the lead of Sybase, Sand Technology, ParAccel, and other columnar pioneers, by storing data in columns, not rows. Since most queries ask for a subset ARCHITECTURE of columns in a row (i.e., the “where” clause) rather than all rows, storing FOR BIG DATA ANALYTICS data in columns minimizes the amount of data that needs to be retrieved from disk and processed by the database, accelerating query performance. In addition, since data elements in many columns are repeated (e.g., PLATFORMS FOR “male” and “female” in the gender field), column-store systems can elimi- RUNNING BIG DATA ANALYTICS nate duplicates and compress data volumes significantly, sometimes as much as 10:1. This enables more data to fit into memory, which speeds processing and minimizes the amount of disk required to store data, mak- PROFILING THE USE ing the systems more cost-effective. OF ANALYTICAL PLATFORMS R Memory. Many analytical platforms make liberal use of memory caches to speed query processing. Some products, such as SAP HANA and Qlik- RECOMMENDA- Tech’s QlikView, store all data in-memory, while others store recently TIONS queried results in a smart cache so others who need to retrieve the same data can pull it from memory rather than from disk. Given the growing affordability of memory and the widespread deployment of 64-bit operat- ing systems, which lift constraints on the amount of data that can be held BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 30
  • 31.
    PLATFORMS FOR RUNNINGBIG DATA ANALYTICS in memory, many analytical platforms are expanding their memory foot- prints to speed query processing. R Query optimizer. Analytical platform vendors invest a lot of time and money researching ways to enhance their query optimizers to handle vari- ous workloads. A good query optimizer is perhaps the biggest contributor EXECUTIVE to query performance. In this respect, the older vendors with established SUMMARY products have an edge. RESEARCH R Plug-in analytics. True to their name, many analytical platforms offer BACKGROUND built-in support for complex analytics. This includes complex SQL, such as correlated subqueries, as well as procedural code implemented as plug-ins to the database. Some vendors offer a library of analytical routines, from WHY BIG DATA? fuzzy matching algorithms to market-basket calculations. Some, like Aster Data (now owned by Teradata), provide native support for MapReduce BIG DATA programs that are called using SQL. ANALYTICS: DERIVING VALUE FROM BIG DATA I Hadoop and NoSQL. Some may argue whether Hadoop and the nonrelation- al databases are analytical platforms. While they don’t store data in rows and columns, both are well-suited to process large volumes of data for analytical ARCHITECTURE purposes. And most use an MPP architecture that scales out on commodity FOR BIG DATA ANALYTICS servers. And some, such as MarkLogic, are full-fledged databases that support transactional integrity. Hadoop in particular differs significantly from most analytical platforms. As PLATFORMS FOR a batch system, it’s not focused on optimizing query performance like other RUNNING BIG DATA ANALYTICS analytical platforms, and thus, does not implement many of the characteristics in the above bulleted list. However, Hadoop’s biggest value is that it’s open source and so can process large volumes of data in a cost-effective way. And PROFILING THE USE like many nonrelational systems, it is schema-less, giving administrators OF ANALYTICAL PLATFORMS greater flexibility to change data structures without having to spend weeks or months rewriting a data model. I RECOMMENDA- TIONS BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 31
  • 32.
    PROFILING THE USEOF ANALYTICAL PLATFORMS Profiling the Use of Analytical Platforms EXECUTIVE SUMMARY NOW THAT WE understand the business context for analytical platforms, the RESEARCH technical architecture in which they run and their technical characteristics, we BACKGROUND can profile their use in user organizations. To do this, I conducted a survey of BI professionals and asked them to describe their use of analytical platforms from a business and technical perspective. The survey provided respondents WHY BIG DATA? with the following definition of an analytical platform: BIG DATA An analytical platform is a data management system optimized for ANALYTICS: query processing and analytics that provides superior price-performance DERIVING VALUE FROM BIG DATA and availability compared with general purpose database management systems. ARCHITECTURE Given this definition, almost three-quarters (72%) of our survey respon- FOR BIG DATA ANALYTICS dents said that they had purchased or implemented an analytical database. While the growth of the analytical platform market has been strong, this 72% figure seems a tad high, given that a majority of analytical database PLATFORMS FOR products have been on the market for less than five years. Upon closer investi- RUNNING BIG DATA ANALYTICS gation, despite our definition, a sizable number of respondents when asked to name their analytical platform identified a general-purpose database, in par- ticular Microsoft SQL Server and Oracle (non-Exadata). Regardless, the data PROFILING THE USE still shows that many companies are turning to specialized analytical plat- OF ANALYTICAL PLATFORMS forms to better meet their analytical requirements. I Non-customers. Among respondents that haven’t purchased an analytical RECOMMENDA- platform, 46% have no plans to do so, 42% are exploring the idea and just TIONS 12% are currently evaluating vendors. On the whole, about 75% of respon- dents will have an analytical platform in the near future (see Figure 10, page 33). BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 32
  • 33.
    PROFILING THE USEOF ANALYTICAL PLATFORMS Figure 10: Do you plan to purchase or implement an analytical platform? (Asked of respondents who don’t yet have an analytical platform) No plans Exploring Currently evaluating EXECUTIVE SUMMARY 0 10 20 30 40 50 RESEARCH BACKGROUND DEPLOYMENT OPTIONS Our survey grouped analytical platforms into four major categories to make it WHY BIG DATA? easier to compare and contrast various product offerings: BIG DATA 1. Analytical databases: They can be described as software-only analytical ANALYTICS: platforms that run on a variety of hardware that customers purchase. Cus- DERIVING VALUE FROM BIG DATA tomers install, configure and tune software, including the analytical database, before they can use the analytical system. Most MPP analytical databases, columnar databases and in-memory databases listed in Table 1 qualify as ana- ARCHITECTURE lytical databases. FOR BIG DATA ANALYTICS 2. Analytical appliances: These are hardware-software combinations designed to support ad hoc queries and other types of analytical processing. PLATFORMS FOR Analytical appliances tightly integrate the hardware and software, often using RUNNING BIG DATA ANALYTICS proprietary components, to optimize performance and minimize the need for tuning. Analytical bundles, which consist of standalone hardware and soft- ware products that a vendor ships as a package, also qualify as analytical PROFILING THE USE appliances. Bundles give administrators more flexibility to tune the system but OF ANALYTICAL PLATFORMS sacrifice deployment speed and manageability. 3. Analytical services: Rather than deploy an analytical platform in a cus- RECOMMENDA- tomer’s data center, an analytical service enables customers to house the sys- TIONS tem in an off-site hosted environment or public cloud. This eliminates up-front capital expenditures and lessens maintenance. 4. File-based analytical system: This generally refers to Hadoop, but we also BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 33
  • 34.
    PROFILING THE USEOF ANALYTICAL PLATFORMS Figure 11: Analytical platform deployment options: Which have you purchased and implemented? Analytical appliance Analytical database File-based analytical system EXECUTIVE SUMMARY Analytical service 0 10 20 30 40 50 RESEARCH BACKGROUND lumped NoSQL or nonrelational systems into this category, although it’s not entirely accurate, since nonrelational systems are databases. However, since WHY BIG DATA? both are used to store and analyze large volumes of unstructured data and don’t’ require an up-front schema design, they share more similarities than dif- BIG DATA ferences. ANALYTICS: DERIVING VALUE FROM BIG DATA Given these categories, most analytical platform customers have either pur- chased or implemented analytical databases (46%) or analytical appliances (49%). Many fewer have implemented a file-based analytical system (10%) ARCHITECTURE or analytical service (5%). (See Figure 11.) FOR BIG DATA ANALYTICS Looking under the covers, analytical database customers are most likely to have purchased Microsoft SQL Server or Oracle, while appliance customers have purchased Teradata Active DW, a Teradata Appliance, or Netezza. Ana- PLATFORMS FOR lytical services customers subscribed to a host of different vendors, while cus- RUNNING BIG DATA ANALYTICS tomers of file-based analytical systems were most likely to purchase a Hadoop distribution from Cloudera, Apache or EMC Greenplum. PROFILING THE USE OF ANALYTICAL PLATFORMS DEPLOYMENT STATUS Drilling into each category further, we find that most of the respondents who have purchased an analytical platform of some type have also deployed the RECOMMENDA- system. Roughly three-quarters of customers with analytical databases (73%) TIONS and slightly more customers of analytical appliances (80%) have deployed their systems. Not surprisingly, 100% of analytical services customers have deployed their systems, but only 33% of customers with file-based analytical systems have implemented theirs (see Table 2, page 35). BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 34
  • 35.
    PROFILING THE USEOF ANALYTICAL PLATFORMS Table 2: Status of deployment options ANALYTICAL ANALYTICAL ANALYTICAL FILE-BASED DATABASE APPLIANCE SERVICE ANALYTICAL SYSTEM Percentage deployed 72% 81% 100% 33% Average years deployed 4.0 4.9 3 1.3 EXECUTIVE SUMMARY RESEARCH With an analytical service, you simply create a data model (which in many BACKGROUND cases is optional) and load your data either by using the Internet or shipping a disk to the provider, and the provider takes care of the rest. Thus, it’s much easier to deploy an analytical service than the other options, accounting for WHY BIG DATA? the 100% deployment figure in Table 2. Analytical appliances generally take less time to deploy than analytical data- BIG DATA bases, which may account for the slightly higher deployment percentage. Ana- ANALYTICS: lytical databases require customers to purchase and install hardware, which DERIVING VALUE FROM BIG DATA may take many months and require multiple sign-offs from the IT, legal and purchasing departments. Despite overwhelming press coverage of Hadoop, few companies have implemented the system. Among those that have, most ARCHITECTURE are largely experimenting, which explains the low deployment percentage FOR BIG DATA ANALYTICS compared with the other options. The figures for “average years deployed” tell a similar story. As the new kid on the block, Hadoop systems have only been deployed for an average of 1.3 PLATFORMS FOR years, followed by analytical services, which have been deployed an average RUNNING BIG DATA ANALYTICS of three years. In contrast, analytical appliances have been deployed for 4.9 years and analytical databases for 4.0 years. PROFILING THE USE OF ANALYTICAL PLATFORMS TECHNICAL DRIVERS When examining the business requirements driving purchases of analytical platforms overall, three percolate to the top: “faster queries,” “storing more RECOMMENDA- data” and “reduced costs.” These requirements are followed by “more com- TIONS plex queries,” “higher availability” and “quicker to deploy.” This ranking is based on summing the percentages of all four deployment options for each requirement (see Figure 12, page 36). More important, Figure 12 shows that customers purchase each deployment BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 35
  • 36.
    PROFILING THE USEOF ANALYTICAL PLATFORMS Figure 12: Business requirements by deployment option (Sorted from most to least for the percentage total of all four deployment options) Faster queries EXECUTIVE SUMMARY Stores more data RESEARCH Reduced costs BACKGROUND More complex queries WHY BIG DATA? BIG DATA Higher availability ANALYTICS: DERIVING VALUE FROM BIG DATA Quicker to deploy I Analytical database I Analytical appliance I Analytical service ARCHITECTURE I File-based analytical system FOR BIG DATA Easier maintenance ANALYTICS Faster load times PLATFORMS FOR RUNNING BIG DATA ANALYTICS More diverse data PROFILING THE USE OF ANALYTICAL More flexible schema PLATFORMS More concurrent users RECOMMENDA- TIONS Built-in analytics 0 10 20 30 40 50 60 70 80 90 100 BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 36
  • 37.
    PROFILING THE USEOF ANALYTICAL PLATFORMS option for different reasons. Analytical database customers value “quicker to deploy” (46%), “built-in analytics” (43%) and “easier maintenance” (41%) more than other requirements, while analytical service customers favor “stor- ing more data” (67%), “high availability” (67%), “reduced costs” (56%) and “more concurrent users” (56%). Not surprisingly, customers with file-based systems look for the ability to support “more diverse data” (64%) and “more EXECUTIVE flexible schemas” (64%), two hallmarks of SUMMARY a Hadoop/NoSQL offering. Analytical appliance customers had the RESEARCH most emphatic requirements. Almost two- Analytical appliance BACKGROUND thirds value faster queries (70%), more customers had the most complex queries (64%) and faster load times (63%), suggesting that analytical emphatic requirements. WHY BIG DATA? appliance customers seek to offload com- Almost two-thirds value plex ad hoc queries from data warehouses. faster queries, more BIG DATA This is exactly the reason that Blue ANALYTICS: Cross Blue Shield of Kansas City pur- complex queries and DERIVING VALUE FROM BIG DATA chased a Teradata Data Warehouse Appli- faster load times. ance 2650 and MicroStrategy analysis tools. The company plans to push about ARCHITECTURE 1 TB of data from its IBM DB2 data ware- FOR BIG DATA ANALYTICS house to the Teradata appliance to support a self-service BI environment for about 300 users. “We want executives and managers to be able to get data and make decisions without depending on the IT department,” says Darren PLATFORMS FOR Taylor, vice president of enterprise analytics and data management. Taylor RUNNING BIG DATA ANALYTICS said the key requirements for the system were query performance and the ability to support complex analytical models and advanced visualization tech- niques, which will be embedded in the self-service solution. PROFILING THE USE Many companies also offload analytical processing to analytical databases. OF ANALYTICAL PLATFORMS For example, a large U.S. retailer recently offloaded complex analytical queries from its maxed-out Teradata data warehouse to ParAccel, a high-performance columnar database. The company chose a software-only system so it could RECOMMENDA- implement the database in a private cloud and spawn new instances in TIONS response to user demand, a key requirement that an analytical appliance does not support. The customer also implemented a direct connection between the two systems using Teradata’s parallel FastExport wire protocol, eliminating the need for the customer to expand its ETL footprint, saving considerable time BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 37
  • 38.
    PROFILING THE USEOF ANALYTICAL PLATFORMS Figure 13: Were you explicitly looking for [this deployment option]? (Percentages based on respondents who answered “Yes”) Analytical database Analytical appliance Analytical service EXECUTIVE SUMMARY File-based analytical system 0 10 20 30 40 50 60 RESEARCH BACKGROUND and money. “Agility and interoperability with existing technologies were key drivers for WHY BIG DATA? the customer,” said Rick Glick, vice president of customer and partner devel- opment at ParAccel. BIG DATA ANALYTICS: DERIVING VALUE FROM BIG DATA I Selection by category. We also asked respondents if they were looking for a specific deployment option when evaluating products (see Figure 13). Except for customers of file-based systems, most customers investigated products ARCHITECTURE irrespective of its technology category. For example, Blue Cross Blue Shield of FOR BIG DATA ANALYTICS Kansas City looked at three columnar databases (i.e., software-only) and an appliance before making a decision. Interestingly, no analytical service cus- tomers intended to subscribe to a service prior to evaluating products. That’s PLATFORMS FOR because many analytical service customers subscribe to such services on a RUNNING BIG DATA ANALYTICS temporary basis, either to test or prototype a system or to wait until the IT department readies the hardware to house the system. Some of these cus- tomers continue with the services, recognizing that they provide a more cost- PROFILING THE USE effective test and development environment than an in-house system. OF ANALYTICAL PLATFORMS BUSINESS APPLICATIONS RECOMMENDA- When push comes to shove, the value of an analytical platform is judged not TIONS by its technical merits, but by the business applications it supports or makes possible. The most popular business applications running on analytical plat- forms are customer analytics, followed by management reports, financial ana- lytics, data integration, executive dashboards and risk analytics. This ranking is BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 38
  • 39.
    PROFILING THE USEOF ANALYTICAL PLATFORMS based on summing the percentages of all four deployment options for each requirement (see Figure 14). Figure 14: Business applications by deployment option (Sorted from most to least for the percentage total of all four deployment options) EXECUTIVE SUMMARY Customer analytics RESEARCH BACKGROUND Management reports WHY BIG DATA? Financial analytics BIG DATA ANALYTICS: Data integration DERIVING VALUE FROM BIG DATA Executive dashboards ARCHITECTURE FOR BIG DATA ANALYTICS Risk analytics PLATFORMS FOR Web traffic analytics RUNNING BIG DATA ANALYTICS I Analytical database Supply chain analytics I Analytical appliance I Analytical service PROFILING THE USE I File-based analytical system OF ANALYTICAL PLATFORMS Logistics analytics RECOMMENDA- Cross-sell TIONS Social media analytics 0 10 20 30 40 50 60 70 80 90 100 BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 39
  • 40.
    PROFILING THE USEOF ANALYTICAL PLATFORMS Figure 15: ROI by deployment option Analytical database Analytical appliance Analytical service File-based analytical system EXECUTIVE SUMMARY 0 10 20 30 40 50 RESEARCH BACKGROUND Figure 14 also exposes stark differences in the business applications sup- ported by each deployment option. For example, an analytical appliance is WHY BIG DATA? more likely to be used for customer analytics (80%), risk analytics (40%), and cross-sell recommendations (29%) than analytical databases, which are BIG DATA more likely to be used for management reports (82%) and executive dash- ANALYTICS: boards (60%). Thus, analytical databases are more likely to be used for tradi- DERIVING VALUE FROM BIG DATA tional top-down reporting, while analytical appliances are used for bottom-up analytics. This contrast makes sense when you remember that many of our analytical database users are customers of Microsoft SQL Server and Oracle ARCHITECTURE 10, which are best-suited to reporting, not analytics. Figure 14 also shows that FOR BIG DATA ANALYTICS file-based systems are twice as likely to be used for Web traffic analysis (46%) and social media analysis (15%) than the other options. PLATFORMS FOR I ROI. Not surprisingly, given its emphasis on analytics versus reporting, ana- RUNNING BIG DATA ANALYTICS lytical appliances (35%) have a higher ROI than analytical databases (26%), with analytical services close behind at 33%. Given their newness, file-based systems delivered a surprisingly strong 25% ROI, but that’s probably because PROFILING THE USE most file-based systems are open source and don’t require an up-front invest- OF ANALYTICAL PLATFORMS ment in software (see Figure 15). RECOMMENDA- TECHNICAL ATTRIBUTES TIONS IApplications and users. When examining business attributes of each deployment option, it’s clear that analytical appliances support far more applications and users than analytical services, analytical databases or file- based systems. The analytical appliance figure is perhaps skewed by the high BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 40
  • 41.
    PROFILING THE USEOF ANALYTICAL PLATFORMS Table 3: Applications and users ANALYTICAL ANALYTICAL ANALYTICAL FILE-BASED DATABASE APPLIANCE SERVICE ANALYTICAL SYSTEM Average number of applications 5.9 11.3 8.0 4.9 EXECUTIVE Average number SUMMARY of concurrent users 47.1 81.4 27.5 27.8 RESEARCH BACKGROUND number of Teradata Active DW customers who responded to the survey. Tera- data Active EDW is geared to supporting multiple workloads, serving as a data warehouse, data mart and operational data store. In addition, its customers WHY BIG DATA? have used the product for many years, and the longer a product is used, the more applications it tends to support (see Table 3). BIG DATA ANALYTICS: IData volumes. Analytical appliances and file-based systems are neck and DERIVING VALUE FROM BIG DATA neck in terms of the amount of data they store. More than 40% of both sets of customers use the systems to store between 10 TB and 100 TB of data, and more than 14% of both options store over 100 TB. In contrast, 40% of analyti- ARCHITECTURE cal database customers have less than 1 FOR BIG DATA ANALYTICS TB of data (see Figure 16, page 42). I Types of data. Not surprisingly, the Analytical appliances PLATFORMS FOR analytical database and analytical appli- RUNNING BIG DATA ance, both of which rely on relational and file-based systems ANALYTICS technology, primarily hold structured areneck and neck in data (90% and 95% respectively). In terms of the amount PROFILING THE USE contrast, analytical services and file- OF ANALYTICAL of data they store. PLATFORMS based analytical systems hold a more balanced mix of data types. More than three-quarters (78%) of analytical serv- RECOMMENDA- ices customers manage structured data, TIONS while 67% manage semi-structured data and 33% unstructured data. In con- trast, file-based systems have more semi-structured data (73%) than either structured (67%) or unstructured (33%). This high percentage reflects the trend of companies insourcing Web data from service bureaus so they can BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 41
  • 42.
    PROFILING THE USEOF ANALYTICAL PLATFORMS combine it with other corporate data, such as sales and orders, and derive more value from it (see Figure 17). Figure 16: Volume of raw data EXECUTIVE SUMMARY Less than 100 GB RESEARCH I Analytical database BACKGROUND 100 GB to 1 TB I Analytical appliance I Analytical service I File-based analytical system WHY BIG DATA? 1 TB to 5 TB BIG DATA 5 TB to 10 TB ANALYTICS: DERIVING VALUE FROM BIG DATA 10 TB to 100 TB ARCHITECTURE FOR BIG DATA 100 TB+ ANALYTICS 0 10 20 30 40 50 60 70 80 90 100 PLATFORMS FOR RUNNING BIG DATA ANALYTICS Figure 17: Types of data PROFILING THE USE OF ANALYTICAL Structured PLATFORMS Semi-structured RECOMMENDA- TIONS I Analytical database Unstructured I Analytical appliance I Analytical service I File-based analytical system 0 10 20 30 40 50 60 70 80 90 100 BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 42
  • 43.
    PROFILING THE USEOF ANALYTICAL PLATFORMS ARCHITECTURE Architectural roles are fairly consistent across deployment options. The noticeable exception is that analytical appliances and analytical services are much more likely to be used for data warehouses than the other options. Also, analytical services and file-based systems are more likely to be used for proto- typing than other systems, and analytical databases are more likely to be used EXECUTIVE as independent data marts (see Figure 18). SUMMARY In addition, the most prominent use of file-based systems is for “prototyp- ing” (44%), followed by “staging area” (38%) and “data warehouse” (38%). RESEARCH Currently, Hadoop is in its early days, and many companies are experimenting BACKGROUND with the technology, which explains the high percentage of prototyping appli- cations. But it’s often used to stage and process Web traffic so companies can summarize and transfer the data into the data warehouse for analysis. WHY BIG DATA? But some companies aren’t moving this data into data warehouses; they are simply leaving it in Hadoop and allowing data scientists to query this “data BIG DATA ANALYTICS: DERIVING VALUE FROM BIG DATA Figure 18: Architecture by deployment option ARCHITECTURE FOR BIG DATA Staging area ANALYTICS Data warehouse PLATFORMS FOR RUNNING BIG DATA ANALYTICS Dependant data mart I Analytical database PROFILING THE USE I Analytical appliance OF ANALYTICAL Independant data mart I Analytical service PLATFORMS I File-based analytical system Development/test RECOMMENDA- TIONS Prototyping 0 10 20 30 40 50 60 70 80 90 100 BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 43
  • 44.
    PROFILING THE USEOF ANALYTICAL PLATFORMS warehouse” of semi-structured or unstructured data. I’m surprised by the 79% of analytical appliance customers who are using the system as a data warehouse. I believe this reflects the large percentage of Teradata Active DW customers who took the survey. But it’s not dissimilar from 2010 survey results that showed that 68% of companies were using ana- lytical platforms in general to power their data warehouses. What I’ve discov- EXECUTIVE ered anecdotally is that companies that power their data warehouse with SUMMARY Microsoft SQL Server are more likely to replace the product with an analytical platform than augment it. Conversely, companies with scalable data ware- RESEARCH house RDBMSs are more likely to aug- BACKGROUND ment the RDBMS with an analytical plat- form than replace it. I’m also surprised that analytical data- WHY BIG DATA? Companies that power bases are the leading platform for dependent and independent data marts, their data warehouse with BIG DATA with 47% and 45% of customers select- Microsoft SQL Server are ANALYTICS: ing these architectural roles respectively. more likely to replace the DERIVING VALUE FROM BIG DATA This undoubtedly reflects the large num- ber of Microsoft SQL Server customers product with an analytical who took the survey, but it’s also not too platform than augment ARCHITECTURE far out of line with our 2010 survey it. Conversely, companies FOR BIG DATA results. ANALYTICS with scalable data ware- house RDBMSs are more PLATFORMS FOR RUNNING BIG DATA TECHNICAL REQUIREMENTS likely to augment the ANALYTICS The technical requirements for selecting RDBMS with an analytical products varied widely by deployment option. Figure 19 (see page 45) ranks platform than replace it. PROFILING THE USE the requirements by sum of the percent- OF ANALYTICAL PLATFORMS ages across all four deployment options. This shows that the top technical requirements are “supports our preferred ETL/BI tools,” “automated distribu- RECOMMENDA- tion of data” and “use of commodity servers.” This is followed by “MPP,” “built- TIONS in fast loading,” “supports unstructured data,” “supports our preferred operat- ing system” and “mixed workload.” What’s striking is the variation in support for these requirements by deploy- ment option. For example, interoperability with existing BI and ETL tools is a BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 44
  • 45.
    PROFILING THE USEOF ANALYTICAL PLATFORMS critical requirement for all options except the file-based system. This makes sense, since most Hadoop developers write custom code in Java, Perl or some other language to construct queries rather than use packaged BI tools. Figure 19: Technical requirements EXECUTIVE SUMMARY Supports preferred ETL/BI tools RESEARCH BACKGROUND Automated distribution of data WHY BIG DATA? Use of commodity servers BIG DATA ANALYTICS: MPP DERIVING VALUE FROM BIG DATA Built-in fast loading utilities ARCHITECTURE FOR BIG DATA ANALYTICS Supports unstructured data Supports our preferred I Analytical database PLATFORMS FOR I Analytical appliance RUNNING BIG DATA operating system I Analytical service ANALYTICS I File-based analytical system Mixed workload PROFILING THE USE OF ANALYTICAL Open source PLATFORMS Supports RECOMMENDA- MapReduce TIONS Supports our preferred hardware 0 10 20 30 40 50 60 70 80 90 100 BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 45
  • 46.
    PROFILING THE USEOF ANALYTICAL PLATFORMS However, BI and ETL vendors are extending their products to interoperate with Hadoop, so this will undoubtedly change, since it’s often easier to use tools than write code. Another variation is that file-based customers are much more interested in “commodity servers,” “open source” and “MapReduce” than customers of other deployment options. This makes sense, since all three requirements are EXECUTIVE critical elements of a Hadoop ecosystem. In contrast, analytical appliances are SUMMARY concerned with MPP, fast-loading utilities and mixed workload functionality. This aligns with the predominant needs RESEARCH of Teradata Active DW customers, who BACKGROUND constituted a large portion of the analyti- cal appliance respondents. It came as no surprise WHY BIG DATA? that support and service are more critical for VENDORS BIG DATA When asked why they selected their customers of analytical ANALYTICS: DERIVING VALUE chosen vendors, respondents were services. FROM BIG DATA mostly likely to say, “Met more of our requirements,” followed by “successful proof of concept,” “liked and trusted the ARCHITECTURE vendor” and “support and service.” Interestingly, pricing was ranked fifth on FOR BIG DATA ANALYTICS the list, followed by customer references and vendor incumbency (see Figure 20, page 47). A vendor’s ability to meet more of a customer’s requirements was more PLATFORMS FOR important for analytical appliance customers than customers of other deploy- RUNNING BIG DATA ANALYTICS ment options. This is for two reasons: (1) Many analytical appliances are new products from startup vendors (except Teradata Active DW), so customers need to make doubly sure that the product meets their requirements, since PROFILING THE USE these vendors have less of a track record and (2) the customer is spending a OF ANALYTICAL PLATFORMS significant amount of money on the product, and it’s playing a central role in the data warehousing architecture. (This is also true for Teradata Active DW customers.) RECOMMENDA- It came as no surprise that support and service are more critical for cus- TIONS tomers of analytical services, who are totally dependent on the quality and responsiveness of the vendor to meet their needs. Pricing is also clearly a big- ger issue for analytical service customers than others, since price (or more likely lack of an up-front capital investment) is a major inducement to hand BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 46
  • 47.
    PROFILING THE USEOF ANALYTICAL PLATFORMS over responsibility for corporate data to a third party. I Incumbency. Interestingly, incumbency can cut both ways. Blue Cross Blue Shield of Kansas City decided not to examine an appliance product from its incumbent data warehouse vendor because it didn’t want to expand its rela- tionship with that vendor. At the same time, it considered a columnar data- EXECUTIVE base from Sybase because it had an existing relationship with the vendor in SUMMARY another area of the business. I RESEARCH BACKGROUND Figure 20: Vendor selection criteria WHY BIG DATA? Met more of our requirements BIG DATA ANALYTICS: Successful POC DERIVING VALUE I Analytical database FROM BIG DATA I Analytical appliance I Analytical service Liked and trusted I File-based analytical system vendor ARCHITECTURE FOR BIG DATA ANALYTICS Support and service PLATFORMS FOR Pricing RUNNING BIG DATA ANALYTICS Customer references PROFILING THE USE OF ANALYTICAL PLATFORMS Incumbent vendor RECOMMENDA- Pre-sales and sales TIONS process Other 0 10 20 30 40 50 60 70 80 90 100 BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 47
  • 48.
    RECOMMENDATIONS Recommendations EXECUTIVE SUMMARY TO ADDRESS THE information needs of the modern corporation, organizations should take the following steps: RESEARCH 1 BACKGROUND Support both top-down and bottom-up business requirements. For too long, organizations have tried to shoehorn all types of users into a single information architecture. That has never worked. Organizations need to WHY BIG DATA? recognize that casual users, who represent a majority of their business users, primarily need top-down, interactive reports and dashboards, while power BIG DATA users need ad hoc exploratory tools and environments. Balancing these polar ANALYTICS: opposite requirements in a single architecture requires new thinking. DERIVING VALUE FROM BIG DATA 2 Implement a new BI architecture. The BI architecture of the future incor- porates traditional data warehousing technologies to handle detailed ARCHITECTURE transactional data and file-based and nonrelational systems to handle FOR BIG DATA ANALYTICS unstructured and semi-structured data. The key is to integrate these systems into a unified architecture that enables casual and power users to query, report and analyze any type of data in a relatively seamless manner. This uni- PLATFORMS FOR fied information access is the hallmark of the next generation BI architecture. RUNNING BIG DATA ANALYTICS More immediately, companies are using Hadoop to preprocess unstructured data so that it can be loaded and integrated with other corporate data for reporting and analysis. This allows BI and ETL users to use familiar tools to PROFILING THE USE query and analyze data. OF ANALYTICAL PLATFORMS 3 Create analytical sandboxes. The new BI architecture brings power users more fully into the corporate information architecture by creating RECOMMENDA- analytical sandboxes that enable them to mix personal and corporate TIONS data and run complex, ad hoc queries with minimal restrictions. Types of ana- lytical sandboxes include (1) virtual sandboxes running as partitions within a data warehouse, (2) a free-standing data mart running a replica of the data warehouse or other data not available in the data warehouse and powered by BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 48
  • 49.
    RECOMMENDATIONS an analytical platform or nonrelational database, (3) an in-memory BI tool that runs on an analyst’s desktop or a corporate server and (4) a Hadoop cluster that stores atomic-level unstructured or semi-structured data. 4 Implement analytical platforms that meet business and technical requirements. Today, organizations implement analytical platforms for EXECUTIVE various reasons. For example, analytical appliances are fast to deploy SUMMARY and easy to maintain and make good replacements for Microsoft SQL Server or Oracle data warehouses that have run out of gas and are ideal as free- RESEARCH standing data marts that offload complex queries from large, maxed-out data BACKGROUND warehousing hubs. Analytical databases, as software-only solutions, run on a variety of hardware platforms and are good for organizations that want to tune database performance for specific workloads or run the RDBMS software on a WHY BIG DATA? virtualized private cloud. Analytical services are great for development, test and prototyping applications as well as for organizations that don’t have an IT BIG DATA department or want to outsource data center operations or get up and running ANALYTICS: very quickly. File-based analytical systems and nonrelational databases are DERIVING VALUE FROM BIG DATA ideal for processing large volumes of Web traffic and other log-based or machine-generated data. Organizations need to carefully evaluate the type and capabilities of the analytical platform they need before purchasing and ARCHITECTURE deploying a system. I FOR BIG DATA ANALYTICS PLATFORMS FOR RUNNING BIG DATA ANALYTICS ABOUT THE AUTHOR Wayne Eckerson has been a thought leader in the data warehousing, business intelli- gence and performance management fields since 1995. He has conducted numerous PROFILING THE USE in-depth research studies and is the author of the best-selling book Performance Dash- OF ANALYTICAL boards: Measuring, Monitoring, and Managing Your Business. He is a noted keynote PLATFORMS speaker and blogger and he consults and conducts workshops on business analytics, performance dashboards and business intelligence (BI), among other topics. For many years, Eckerson served as director of education and research at The Data Warehousing Institute (TDWI), RECOMMENDA- where he oversaw the company’s content and training programs and chaired its BI Executive Summit. TIONS Eckerson is currently director of research at TechTarget, where he writes a popular weekly blog called Wayne’s World, which focuses on industry trends and examines best practices in the application of busi- ness intelligence. (See www.b-eye-network.com/blogs/eckerson.) Wayne is also president of BI Leader Consulting (www.bileader.com) and founder of BI Leadership Forum (www.bileadership.com), a network of BI directors who meet regularly to exchange ideas about best practices in BI and educate the larger BI community. He can be reached at weckerson@techtarget.com. BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS 49
  • 50.
    RESOURCES FROM OURSPONSOR • The Business Case for Transparent Databases, by Sandy Steier (White Paper) • BIG DATA Analytics, This time it’s personal, by Robin Bloor (White Paper) • AutoZone’s Cloud BI Initiative, The Endless Possibilities of Cloud BI (Case Study) About 1010data: 1010data is the first interactive cloud-based platform for Big Data analytics. The company’s namesake service provides the most powerful, usable and scalable solution available today for investigative and predictive analytics. It does this by combining ultra-fast database technology with a rich and sophisticated array of built-in analytical functions and an intuitive worksheet user interface, and delivers them as a managed service that offers the fastest time to value. And, with 1010data, there is no need for complex, time-consuming data design, integration, or transformation steps. The 1010data cloud also hosts and enables access to a growing number of large proprietary and public data sets, including ones for credit reporting, mortgage-backed securities, real estate, labor statistics, and more. 1010data is used by hedge funds, global banks, large securities exchanges, top retailers, leading consumer packaged goods companies, and many others industry leaders to manage, manipulate and monetize trillions of business data records every day
  • 51.
    RESOURCES FROM OURSPONSOR • LiveRail Implements Infobright and Hadoop for Video Advertising • What’s Cool About Columns...and How to Extend Their Benefits • Infobright Analytic Database: Architecture Overview About Infobright: Infobright develops and markets a high performance, self-tuning analytic database designed for applications and data marts that analyze Big Data, especially “machine-generated data” such as web data, network logs, telecom records, stock tick data and sensor data. Easy to implement and with unmatched data compression, operational simplicity and low cost, Infobright is being used by enterprises, SaaS and software companies in online businesses, telecommunications, financial services and other industries to provide rapid analysis of critical business data.
  • 52.
    RESOURCES FROM OURSPONSOR • Big Data Unleashed: Turning Big Data into Big Opportunities White Paper • ESG Report: Informatica 9.1 and Integrating Big Data • Integrating Social Media into MDM Demo About Informatica: Informatica is the world’s number one independent provider of data integration software. With Informatica, thousands of organizations turn to Informatica to gain a competitive advantage in today's global information economy with timely, relevant and trustworthy data for their top business imperatives. Enterprises rely on Informatica Data Integration and Data Quality solutions to gain a competitive advantage from their information assets to grow revenues, increase profitability, further regulatory compliance and foster customer loyalty. The Informatica Platform provides a comprehensive, unified and open approach to lower IT costs and gain competitive advantage from their data held in traditional enterprise and in the internet cloud.
  • 53.
    RESOURCES FROM OURSPONSOR • The Post-Relational Reality Sets In: 2011 Survey on Unstructured Data • Leading Analyst Predicts Big Changes from Big Data: Exclusive Interview Recording • Addressing the Challenges of Unstructured Information with Purpose-built Technology About MarkLogic: MarkLogic empowers organizations to make high stakes decisions on Big Data in real time. Customers trust MarkLogic for mission critical applications that drive revenue and growth through Big Data Analytics enabled by MarkLogic products, services, and partners. MarkLogic is a fast growing enterprise software company that has been providing solutions to the public sector and Global 1000 for nearly a decade. Operating at petabyte scale, MarkLogic Server is a next generation database for unstructured information that allows customers to outflank their competition by consistently getting to better decisions faster. MarkLogic is headquartered in Silicon Valley with field offices in Austin, Boston, Frankfurt, London, Tokyo, New York, and Washington D.C.
  • 54.
    RESOURCES FROM OURSPONSOR • ParAccel Analytic Platform Datasheet • ParAccel Video: Trends and Architectures for Modern Analytics • ParAccel Deployment Options: On-Premise and in the Cloud About ParAccel: ParAccel's analytic platform is built from the ground up to provide the highest performance for the widest variety of analytic workloads. It includes 500+ advanced in-database functions along with dynamic analytic and data integration. ParAccel's software only approach enables physical and virtual deployment, either on-premise or in public/private clouds.
  • 55.
    RESOURCES FROM OURSPONSOR • How to rapidly deploy BI applications across the Enterprise • Agile BI deployment for Enterprise • Why simple scalability is the key to Big Data About SAND: SAND is the world's most advanced analytic database, managing massive amounts of big data, driving unparalleled performance, and deploying information to tens of thousands of concurrent users across the enterprise. With industry-leading software solutions for CRM and Loyalty, and having achieved "Certified for SAP NetWeaver" status and "Powered by SAP NetWeaver" status, SAND delivers best-of-bread analytic performance to over 600 customers around the world. SAND Technology has offices in the United States, Canada, Western and Central Europe, and Australia and can be reached online at www.sand.com.
  • 56.
    RESOURCES FROM OURSPONSOR • Leveling the Playing Field: How Companies Use Data for Competitive Advantage • The Intelligence Future: Simple, Seamless, Social • SAP HANA: Helping Businesses Run Better - in Real-Time About SAP: SAP's vision is for companies of all sizes to become best-run businesses. Best-run businesses transform rigid value chains into dynamic business networks of customers, partners, and sup- pliers. They close the loop between strategy and execution, help individuals work more pro- ductively, and leverage technology for sustainable, profitable growth. This vision is in keeping with SAP's mission to accelerate business innovation for companies and industries worldwide - contributing to economic development on a grand scale.
  • 57.
    RESOURCES FROM OURSPONSOR • The Transition Layer: The Role of Analytical Talent • Nine Data Prep Lessons for Advanced Analytics • Operational Analytics: Putting Analytics to work in Operational Systems About SAS: SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market. Through innovative solutions delivered within an integrated framework, SAS helps customers at more than 50,000 sites improve performance and deliver value by making better decisions faster. Since 1976 SAS has been giving customers around the world The Power to Know®.