A unified data modeler in the world of big data

A Unified Data
Modeler in the World
of Big Data
William Luk, CA Technologies Inc
2012
Sr Director, Software Engineering – Data Modeling Collaboratio
Session Code: HT01 n By
Design

Speaker Bio

Senior Director of software
development in the Data
Management BU, head of
ERwin engineering and level
2 support
 Experience in
databases, data
security, and data
management;
 BS & MS in CS;

A Unified Data Modeler in the World of Big Data

Session Agenda
— Where are we & how do we get here?
— Overview of the Big Data world
— Challenges to enterprises and data architect
— Extending data modeling to include Big Data
—Q&A

3

Data Modeling Past 30 Years

— Entity-Relationship (ER) modeling has served us well
since mid-70’s
— Data architects / modelers have used ER tools to
ensure data consistencies and integrities for very
large enterprises
— Ability to integrate new databases from mergers and
acquisitions;
— A map of where all your data;
— Ability to handle large & complex data model;
— Then, the Internet & social networks
4

Internet & Social Networks

— Early Internet used the classical LAMP stack – Linux,
Apache Web Server, MySQL Database, and
Perl/PHP/Python
— Basic web servers & DB’s served us well for basic
web portal
— Internet growth + social networks changed the scale
of database / data store
— Traditional relational databases have difficulties
handling the scale & required (sub second) response
time for web
— Emergences of NoSQL data store
5

Arrival of Big Data

— Wealth of valuable data to collect:
− Users entered information
− History / logs of users interaction

— Not always fit nicely into structured data stores
(relational or NoSQL)
— Need to harvest / analyze the data to compete
— Challenges of
capturing, storing, searching, anlayzing, and
visualizing very large and complex data sets
— Large, distributed, analytical platforms (Hadoop)
emerged
6

Enterprise Big Data / Hadoop Workflow

Customer Data Source

HQL (Hive SQL),
JSON, XML, …etc Unstructured Data / Files

HDFS
Structured Data Semi-structured Data Unstructured Data
JSON
Hive HBas XML JSON

e

MapReduce /
Analytics
Hadoop Framework (Pig, Cloudera,
(Clusters) Datameer, …etc)


Problem of Non-Relational Data Stores

— NoSQL and unstructured data store performance has
a price:
− Denormalized data
− Data consistencies & integrities – only guarantee “eventual”
consistency
— Some data (such as user comments) can tolerate
these drawbacks
— Some data (such as financial, transactional) cannot
— Enterprises conclusions:
− NoSQL & Big Data are good for business intelligences data
− Financial & transactional data still require relational databases
− Compliance requirements / regulations
8

The New World of Data Modeling with Relational
and Big Data

— The new enterprise data landscape:
− Different relational databases
− Distributed hadoop cluster with structured, semi-structured,
and unstructured data which is constantly changing

— Challenges to the data architects / modelers:
− Identify potential relationships between different data stores
− Automated way to track and update the unified view

— Data Modeling tools, such as ERwin, need to evolve
to present a single unified view of ALL enterprise
data

9

ERwin Tapping into Hadoop

Data Sources

JSON / XML Headers
HQL (Hive SQL),
JSON, XML,
Unstructured Data / Files

HDFS
Structured Data Semi-structured Data Unstructured Data
JSON
Hive HBas XML JSON

HQL e

MapReduce /
Analytics
Hadoop Framework (Pig, Cloudera,
(Clusters) Datameer, …etc)


CA Internal Proof of Concept
Big Data of CA Enterprise Products

APM, Clarity,
Nimsoft, • CA Hadoop test framework
WatchMouse, …etc with 7 Dell 2950’s
Unified View of
CQL (Cassandra SQL), • Dump / store logs & data from
All Models HQL (Hive SQL),
mongoDB, various CA products into
JSON, XML HDFS
Reverse Engineer • Transform logs & data into
JSON / XML Headers
structured or semi-structured
data stores
CA Hadoop Test Semi-structured Data • Reverse engineer to build
Framework (HDFS / Cassandra FS) logical model of different CA
JSON products
XML JSON

• Identify potential relationships
Cassandra / between data stores
Reverse Engineer Hive / Hbase /
CQL / HQL /
Mongo Query
mongoDB
(JSON)


What We Learn So Far

— Most non-relational data store will be a simple entity / box in
ERwin
− Attributes in each non-relational entity include key indices and columns
− Supercolumns or nested structures can be expanded in the same entity
or depict as hierarchy
— Metadata are important:
− Describes the kind of information / data
− Structure of the columns in a supercomlumn
— There are relationships between non-relational data stores and
relational databases
— So far, we only investigated reverse engineering of data stores
into logical model. Forward engineering of logical model into
physical non-relational data stores may be useful
— We are not there yet, but a unified data modeler of relational and
Big Data is definitely possible

12

The Future of Data Modeling

— Presented a (but not the only) direction that data
modeling can be evolved to model both relational
and non-relational data stores
— Data explosion will continue and accelerate at a
much faster rate
— Business must rely more and more on collected data
to gather business intelligence to compete
— Role of data architect and modeler will become more
important – to analyze Big Data, enterprises must
first understand what they have!

13

Thank You – Questions?

William Luk
(650)298-3111
William.luk@ca.com
http://www.linkedin.com/pub/william-luk/1/818/bb1

Legal notice

Copyright © 2012 CA. All rights reserved. All trademarks, trade names, service marks and logos referenced
herein belong to their respective companies. No unauthorized use, copying or distribution permitted.

THIS PRESENTATION IS FOR YOUR INFORMATIONAL PURPOSES ONLY. CA assumes no responsibility for
the accuracy or completeness of the information. TO THE EXTENT PERMITTED BY APPLICABLE LAW,
CA PROVIDES THIS DOCUMENT ―AS IS‖ WITHOUT WARRANTY OF ANY KIND, INCLUDING,
WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE, OR NONINFRINGEMENT. In no event will CA be liable for any loss or
damage, direct or indirect, in connection with this presentation, including, without limitation, lost profits,
lost investment, business interruption, goodwill, or lost data, even if CA is expressly advised of the
possibility of such damages.

Certain information in this presentation may outline CA’s general product direction. This presentation shall not
serve to (i) affect the rights and/or obligations of CA or its licensees under any existing or future written
license agreement or services agreement relating to any CA software product; or (ii) amend any product
documentation or specifications for any CA software product. The development, release and timing of any
features or functionality described in this presentation remain at CA’s sole discretion.

Notwithstanding anything in this presentation to the contrary, upon the general availability of any future CA
product release referenced in this presentation, CA may make such release available (i) for sale to new
licensees of such product; and (ii) in the form of a regularly scheduled major product release. Such
releases may be made available to current licensees of such product who are current subscribers to CA
maintenance and support on a when and if-available basis.

A unified data modeler in the world of big data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A unified data modeler in the world of big data

Similar to A unified data modeler in the world of big data (20)

Recently uploaded

Recently uploaded (20)

A unified data modeler in the world of big data