Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
A unified data modeler in the world of big data
1. A Unified Data
Modeler in the World
of Big Data
William Luk, CA Technologies Inc
2012
Sr Director, Software Engineering – Data Modeling Collaboratio
Session Code: HT01 n By
Design
2. Speaker Bio
Senior Director of software
development in the Data
Management BU, head of
ERwin engineering and level
2 support
Experience in
databases, data
security, and data
management;
BS & MS in CS;
3. A Unified Data Modeler in the World of Big Data
Session Agenda
— Where are we & how do we get here?
— Overview of the Big Data world
— Challenges to enterprises and data architect
— Extending data modeling to include Big Data
—Q&A
3
4. Data Modeling Past 30 Years
— Entity-Relationship (ER) modeling has served us well
since mid-70’s
— Data architects / modelers have used ER tools to
ensure data consistencies and integrities for very
large enterprises
— Ability to integrate new databases from mergers and
acquisitions;
— A map of where all your data;
— Ability to handle large & complex data model;
— Then, the Internet & social networks
4
5. Internet & Social Networks
— Early Internet used the classical LAMP stack – Linux,
Apache Web Server, MySQL Database, and
Perl/PHP/Python
— Basic web servers & DB’s served us well for basic
web portal
— Internet growth + social networks changed the scale
of database / data store
— Traditional relational databases have difficulties
handling the scale & required (sub second) response
time for web
— Emergences of NoSQL data store
5
6. Arrival of Big Data
— Wealth of valuable data to collect:
− Users entered information
− History / logs of users interaction
— Not always fit nicely into structured data stores
(relational or NoSQL)
— Need to harvest / analyze the data to compete
— Challenges of
capturing, storing, searching, anlayzing, and
visualizing very large and complex data sets
— Large, distributed, analytical platforms (Hadoop)
emerged
6
7. Enterprise Big Data / Hadoop Workflow
Customer Data Source
HQL (Hive SQL),
JSON, XML, …etc Unstructured Data / Files
HDFS
Structured Data Semi-structured Data Unstructured Data
JSON
Hive HBas XML JSON
e
MapReduce /
Analytics
Hadoop Framework (Pig, Cloudera,
(Clusters) Datameer, …etc)
A Unified Data Modeler in the World of Big Data
8. Problem of Non-Relational Data Stores
— NoSQL and unstructured data store performance has
a price:
− Denormalized data
− Data consistencies & integrities – only guarantee “eventual”
consistency
— Some data (such as user comments) can tolerate
these drawbacks
— Some data (such as financial, transactional) cannot
— Enterprises conclusions:
− NoSQL & Big Data are good for business intelligences data
− Financial & transactional data still require relational databases
− Compliance requirements / regulations
8
9. The New World of Data Modeling with Relational
and Big Data
— The new enterprise data landscape:
− Different relational databases
− Distributed hadoop cluster with structured, semi-structured,
and unstructured data which is constantly changing
— Challenges to the data architects / modelers:
− Identify potential relationships between different data stores
− Automated way to track and update the unified view
— Data Modeling tools, such as ERwin, need to evolve
to present a single unified view of ALL enterprise
data
9
10. ERwin Tapping into Hadoop
Data Sources
JSON / XML Headers
HQL (Hive SQL),
JSON, XML,
Unstructured Data / Files
HDFS
Structured Data Semi-structured Data Unstructured Data
JSON
Hive HBas XML JSON
HQL e
MapReduce /
Analytics
Hadoop Framework (Pig, Cloudera,
(Clusters) Datameer, …etc)
A Unified Data Modeler in the World of Big Data
11. CA Internal Proof of Concept
Big Data of CA Enterprise Products
APM, Clarity,
Nimsoft, • CA Hadoop test framework
WatchMouse, …etc with 7 Dell 2950’s
Unified View of
CQL (Cassandra SQL), • Dump / store logs & data from
All Models HQL (Hive SQL),
mongoDB, various CA products into
JSON, XML HDFS
Reverse Engineer • Transform logs & data into
JSON / XML Headers
structured or semi-structured
data stores
CA Hadoop Test Semi-structured Data • Reverse engineer to build
Framework (HDFS / Cassandra FS) logical model of different CA
JSON products
XML JSON
• Identify potential relationships
Cassandra / between data stores
Reverse Engineer Hive / Hbase /
CQL / HQL /
Mongo Query
mongoDB
(JSON)
A Unified Data Modeler in the World of Big Data
12. What We Learn So Far
— Most non-relational data store will be a simple entity / box in
ERwin
− Attributes in each non-relational entity include key indices and columns
− Supercolumns or nested structures can be expanded in the same entity
or depict as hierarchy
— Metadata are important:
− Describes the kind of information / data
− Structure of the columns in a supercomlumn
— There are relationships between non-relational data stores and
relational databases
— So far, we only investigated reverse engineering of data stores
into logical model. Forward engineering of logical model into
physical non-relational data stores may be useful
— We are not there yet, but a unified data modeler of relational and
Big Data is definitely possible
12
13. The Future of Data Modeling
— Presented a (but not the only) direction that data
modeling can be evolved to model both relational
and non-relational data stores
— Data explosion will continue and accelerate at a
much faster rate
— Business must rely more and more on collected data
to gather business intelligence to compete
— Role of data architect and modeler will become more
important – to analyze Big Data, enterprises must
first understand what they have!
13
14. Thank You – Questions?
William Luk
(650)298-3111
William.luk@ca.com
http://www.linkedin.com/pub/william-luk/1/818/bb1