Information Management in the Age of Big Data

Information Management
in the Age of Big Data

Mark Burnard
EMC Greenplum
March 2012
mark.burnard@emc.com

© Copyright 2011 EMC Corporation. All rights reserved. 1

So what is “Big Data”?

B ig D a t a is
m e
l u
• m a s s i v e n e w d a t a v o lV s
ume
o
Va
• a n d n e w d a ta typ e s r ie
• g e ne ra te d b y ma ny ne w d e v ic e s
ty
Velocity

Volume


Meter Data is Growing Exponentially

3,000x 35040
R e ad s p e r Ye ar

700x
120x
8760

30x
1 460
12 365

M e te r-re ad ing fre qu e ncy


Big Data use case: Smart Meter data (Consumer view)


Big Data use case: Smart Meter data (Utility view)


Variety


Velocity


Use Cases for
Big Data Analytics
from Ralph Kimball

“…systems to support big data
analytics have to look very
different than the classic
relational database systems
from the 1980s and 1990s.
The original RDBMSs were
not built to handle any of these
requirements!”

- Ralph Kimball

Source: Kimball, Ralph “The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics”


Traditional Data Warehousing
(and Business Intelligence and Business Analytics)

• it’s expensive
• enhancements and projects take too long
• it drives people to create their own “data feifdoms”


The challenges of Data Warehousing…

are now exacerbated by

the era of

Big Data

Big Data will revolutionise
Data Warehousing and analytics.

New Realities…
• Do it faster
– Volume: ingest more data
– Velocity: ingest it faster New Demands!
• Manage new data types
– Variety: manage and allow queries across structured, semi-structured and unstructured
data

• Be more flexible
– Unpredictable queries, Rapidly evolving bespoke analytics
– New tools: Hadoop, MapReduce, Hive, HBase, “R”

• Do it at a lower cost
– And, keep it unsummarised, and keep it for longer


Information Management Strategy
for Big Data
Current State Target State Transition Plan

Assessment of current
Resource gap, training
People & organisational structure Required skillsets and
plan and insource/
and capabilities vs organisational structure
Skillsets requirements of the to support future state
outsource/ suppliment
model
future state

Review of current
Sustainable approach to Incremental approach
Processes & methodologies,
information management to implement new
processes & governance
in light of differing levels processes, methodology
Methodology vs fit for purpose
of governance needed and governance
(future state)

Demarcation of subject Implementation plan for
Information Review of requirements areas by level of rigour in new platforms, models &
and fitness for purpose; data mgmt; new data frameworks, and
Architecture map of datamart feifdoms models & data absorbtion of datamart
management frameworks fefidoms

Required future Roadmap for
Technology Review of current
technology platforms, implementing target state
platforms & capabilities
Architecture vs business needs
ecosystem and technologies, prioritised
architecture by business benefit


Old School New School
Information
Data Model - centric Business - centric
• Driven by the Enterprise Data Model • Driven by business need to turn data into information, and
(Corporate Information Factory) by Business-led projects (long- and short-term)
• Huge effort and expense in transforming, • Little or no transformation - business logic is pushed out to
cleansing and matching data (conformed the business. (eg the "Transformationless Warehouse", or
dimensions etc) "Data Vault")
• Big challenges and expenses in managing • Simple data lineage, reduced need for metadata
metadata, data lineage, MDM integration management. Master Data is just another data source.
• Different data sources can update the UAP at different
• Data loads from multiple systems must be
intervals, from trickle-feed to hourly/ nightly/ weekly/
coordinated and inter-dependencies
monthly/ ad-hoc, as long as the users know when the last
managed in the ETL scheduling tool and
refresh occurred. Some datasets are "pointers" to external
framework
data sources - no replication.
• Structured data • Structured, semi-structured and unstructured data
• Often forced to work with subsets of data, or • Platform handles analytics on full datasets, unsummarised -
forced to summarise data older than 'n' > much richer insights. (Wired Magazine: "The End of
days/months/years Science")


Technology
Constrained by Technology Empowered by Technology
• Low cost of space and performance means teams can cycle
• High cost of space and performance means
queries and investigations much faster -> different way of
access/use is rationalised/restricted
working: more cycles -> more accurate results
• Adding new data sources or developing new • Adding a new data source to the platform takes minutes, and
data marts / subject areas typically takes the logic to integrate the data source is applied by the
months business / analyst
• Architecture is usually "scale up" - requires • Architecture is "scale out" - add capacity without down
expensive offload-copy-restore when time. Possible to use "hybrid cloud" model to add capacity
increasing capacity on demand during peak periods.
• Dev, Test and DR environment require their • Dev, Test and DR can be virtual machines, provisioned and
own servers, maintenance etc scaled on demand.
• Processing is in ETL servers, in database, and • Processing is almost entirely in database. Data movement is
in BI application servers. minimised.
• Many orphan data marts on PCs, laptops, • Need for user-created marts is met on the Unified Analytics
servers Platform. Safety with flexibility.


Processes
IT-centric and Control-heavy Trust and enablement
• Safety is in knowledge management, collaboration and
• Safety is in IT control.
peer review
• Precision needed - must reconcile and must be
• Approximate results may be acceptable (depending on
exact. Gold standard applies to all data in the
the business use case)
enterprise data model
• Enforce simplicity - hide complexity from the • Expose complexity; trust the team. Build and iterate
business (dumb it down; drag and drop from a reports from whatever data sources you need (and are
restricted semantic layer) authorised to access)
• Emphasis on process - fill out the form, submit
• Emphasis on Self service
the request
• Information enables forward-looking insights -> supports
• Information supports "rear view mirror"
innovation centres and business process re-engineering
reporting on the past
or tweaking
• Analysts react to difficulty accessing data by
• Analytical sandpits are supported on the UAP - logic
creating copies of data in "off the radar"
applied can be peer reviewed in the platform
databases; logic applied is unauditable


People
Information consumption Information-led Innovation
(fixed reporting) (flexible exploring)
• Focus is on standard reports for directors and • Reporting is so BAU it is not the focus; analysts
managers (analysts get the leftovers) empowered to get creative and add much more value.
• Business doesn't trust the warehouse (logic • Business has control of the logic and transformations (if
applied in transformations is opaque) you don’t trust it… fix it yourself - you built it!)
• Multiple data types and repositories (RDBMS, Hadoop,
• Single platform, single RDBMS, with many "off
text, logs) - must be accessible via an overlying single
the radar" data marts
interface/platform (UAP)
• LOBs can collaborate using web 2.0/KM tools built into
• LOBs working in silos
the UAP
• Tightly controlled data dictionaries and
• Wiki-style approach for a “data asset registry” allows
metadata management to preserve the Single
collaborative and agile metadata management
Source of Truth
• "Power user" floats around training and
• Data Scientist floats around educating and empowering
troubleshooting


Old School

New School Agile Process & Tools
Analytics Engines
Analytic Engines

Analytic Productivity Platform

Technology & Information People & Processes


Unified Analytics Platform - Customer Example:
T-Mobile

Greenplum Database + EDC Chorus

10 0 T B E n t e r p r i s e 1 P e ta b yte
DW A n a ly t ic D W
Greenplum Database + Chorus:
ustomer Challenge: – Extracted data from EDW and other source
systems to quickly assemble new analytic mart
– 100TB EDW focused on operational reporting
– Generated a social graph from call detail records
and financial consolidation
and subscriber data
– EDW is single source of truth, under heavy – Within 2 weeks uncovered behavior where
governance and control “connected” subscribers where 7X more likely to
– Unable to support all of the critical initiatives churn than average user
around data surrounding the business – Deployed1PB production EDC with GP to power
– Customer loyalty and churn the #1 business their analytic initiatives
initiative from the CEO on down


T-Mobile Churn Analysis
• Extracted data from EDW and
other source systems into new
analytic sandbox
• Generated a social graph from call
detail records and subscriber data
• Within 2 weeks uncovered
behavior where “connected”
subscribers were seven times
more likely to churn than average
user
• T-Mobile valued this insight at
$70 million (for a $1 million
investment in Greenplum).


Information Management in the age of Big Data

from to

People & Information-led
Skillsets Information consumption
Innovation

Processes & IT-centric and Business-centric;
Methodology
control-heavy empowerment and trust

Information
Architecture Data Model - centric Business needs - centric

Technology Constrained by Empowered by
Architecture Technology Technology


Unified Analytics Platform - Customer Example:
T-Mobile

Greenplum Database + EDC Chorus

10 0 T B E n t e r p r i s e 1 P e ta b yte
DW A n a ly t ic D W
Greenplum Database + Chorus:
ustomer Challenge: – Extracted data from EDW and other source
systems to quickly assemble new analytic mart
– 100TB EDW focused on operational reporting
– Generated a social graph from call detail records
and financial consolidation
and subscriber data
– EDW is single source of truth, under heavy – Within 2 weeks uncovered behavior where
governance and control “connected” subscribers where 7X more likely to
– Unable to support all of the critical initiatives churn than average user
around data surrounding the business – Deployed1PB production EDC with GP to power
– Customer loyalty and churn the #1 business their analytic initiatives
initiative from the CEO on down


T-Mobile Churn Analysis
• Extracted data from EDW and
other source systems into new
analytic sandbox
• Generated a social graph from call
detail records and subscriber data
• Within 2 weeks uncovered
behavior where “connected”
subscribers were seven times
more likely to churn than average
user
• T-Mobile valued this insight at
$70 million (for a $1 million
investment in Greenplum).


Traffic Network Modelling


Parallel Model Learning
• Solving tens of thousands of statistical modelling problems, one
for each road in the city, in parallel:
SELECT origin, dest,
madlib.linregr(travel_time,
array[peak_period(entry_time), …
origin_vol, dest_vol])
FROM route_travel_info
GROUP BY origin,dest;

• A model: t(x) = 466 + 7.72 peakPeriod(x) + 22.5 workDay(x) +
0.378 originVol(x) + 0.691 destVol(x)


Applications for a Traffic Network Model
• Compute the shortest path between any two nodes at a
future time point
• Identify potential bottlenecks in the traffic network through
betweenness centrality scores
• Identify phase transition points for massive traffic congestion
using simulation techniques
• Study the likely impact of new roads and traffic policies like
the proposed 40 km/hr speed limit in Sydney CBD


the Big Data writing is on the wall…

The Data Warehouse Institute (TDWI):

• 50% of TDWI survey respondents will replace
their DW platform in the next 3 years because:
Cannot do Poor query response 45%
advanced analysis Can’t support advanced analytics 40%
Inadequate data load speed 39%
Cannot handle Can’t scale up to large date volumes 37%
big data Cost of scaling up is too expensive 33%
volumes Poorly suited to real-time or on-demand workloads 29%

Source: TDWI Next Gen Database Study, 2010


The Greenplum Unified Analytics Platform

Data Data Data Bl LOB
People DATA SCIENCE TEAM Scientist Engineer Analyst Analyst User

Greenplum Chorus - Analytic Productivity Layer
Processes
3rd Party/Partner Tools & Services

Information Data Access & Query Layer
Data
Platform Greenplum Database Greenplum Hadoop
Admin

Technology
Private/Hybrid Cloud Infrastructure or Appliance


Information Management
in the Age of Big Data

Thank you


Information Management in the Age of Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (13)

Similar to Information Management in the Age of Big Data

Similar to Information Management in the Age of Big Data (20)

Recently uploaded

Recently uploaded (20)

Information Management in the Age of Big Data

Editor's Notes