The document discusses data governance, compliance and security in Hadoop. It provides an agenda for an event on this topic, including presentations from Joe Caserta of Caserta Concepts on data governance in big data, and Patrick Angeles of Cloudera on using Cloudera for data governance in Hadoop. The document also includes background information on Caserta Concepts and their expertise in data warehousing, business intelligence and big data analytics.
2. Agenda
7:00
Networking (15 min)
Grab some food and drink... Make some friends.
7:15
Joe Caserta (45 min)
Welcome + Intro (10 Min)
President, Caserta Concepts
About the Meetup, about Caserta Concepts
Data Governance in Big Data (35 min)
Overview of Data Governance and
implementation options in Hadoop.
8:00
Patrick Angeles (45 min)
Using Cloudera to ensure data
Chief Architect, Financial Services, governannce in Hadoop
Cloudera
8:45
Deep dive into Cloudera Data Governance Tools.
Q&A, More Networking (15 min)
Tell us what you’re up to…
3. Joe / Caserta Concepts Timeline
Established best practices for big
data ecosystem implementation –
Healthcare, Finance, Insurance
2013
Launched Big Data Warehousing
Meetup in NYC – 850+ Members
2012
Partnered with Big Data vendors
Cloudera, HortonWorks, Datameer, m
ore…
2010
2009
Laser focus on extending Data
Warehouses with Big Data solutions
Formalized Alliances / Partnerships –
System Integrators
Launched Big Data practice
2004
Launched Training practice, teaching
data concepts world-wide
2001
Founded Caserta Concepts in NYC
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Web log analytics solution published
in Intelligent Enterprise
1996
Began consulting database
programing and data modeling
Dedicated to Data
Warehousing, Business Intelligence
since 1996
1986
25+ years hands-on experience
building database solutions
4. Caserta Concepts
Technology services company with focused expertise in:
Data Warehousing
Business Intelligence
Big Data Analytics
Data is all we do.
Established in 2001:
Industry recognized work force
Consulting, Education, Implementation
Broad experience across industries:
Healthcare / Financial Services / Insurance
Manufacturing / Higher Education / eCommerce
5. Implementation Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting
Big Data
Analytics
Storm
Database
BI/Visualization/
Analytics
Master Data Management
8. Caserta Concepts
Listed as a Top 20 Most Promising
Data Analytics Consulting Companies
CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones
who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of
CIOReview selected the Final 20.
9. Help Wanted
Does this word cloud excite you?
Cassandra
Big Data Architect
Storm
Hbase
Speak with us about our open positions: leslie@casertaconcepts.com
10. Innovation is the only sustainable
competitive advantage a company can have.
11. About the BDW Meetup
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear about
yours
• Great networking opportunity for like minded
data nerds
• Opportunities to collaborate on exciting
projects
• Founded by Caserta Concepts, DW, BI & Big
Data Analytics Consulting
• Next BDW Meetup: March 25, 2014
• New Work City, Broadway & Canal
• Topic TBD, Suggestions?
12. Why Big Data?
Enrollments
Claims
Traditional BI
ETL
Traditional
EDW
Big Data Analytics
Finance
ETL
Ad-Hoc/Canned
Reporting
Others…
Data Science
Big Data Cluster
NoSQL
Databases
Ad-Hoc Query
Mahout
N1
MapReduce
N2
N3
Pig/Hive
N4
N5
Hadoop Distributed File System (HDFS)
Horizontally Scalable Environment - Optimized for Analytics
Canned Reporting
13. Quick Vocabulary Lesson
Hadoop Distribution: Cloudera, MapR, Hortonworks, Pivotal-HD
Tools:
Whirr: Used to launch/kill computing clusters
Kafka: Publish-subscribe messaging system
Mahout: Machine learning
Hive: Map data to structures and use SQL-like queries
Pig: Data transformation language for big data, from Yahoo
Sqoop: Extracts external sources and loads Hadoop
Zookeeper: Used to manage & administer Hadoop
Storm: Real-time ETL
NoSQL:
Document: MongoDB, CouchDB
Graph: Neo4j, Titan
Key Value: Riak, Redis
Columnar: Cassandra, Hbase
Languages: Python, SciPy, Java
14. The Challenges With Big Data
• Data volume is higher
so the process must
be more reliant on
programmatic
administration
• Less people/process
dependence
Volume
Veracity
• Dealing with sparse,
incomplete, volatile,
and highly
manufactured data.
How do you certify
sentiment analysis?
Variety
• Wider breadth of
datasets and sources
in scope requires
larger data
governance support
• Data governance
• cannot start at the
warehouse
Velocity
• Data is coming in so
fast, how do we
monitor it?
• Real real-time analytics
• What does “complete”
mean
15. Why is Big Data Governance Important?
Convergence of
Data quality
Management and policies
All data in an organization?
Set of processes
Ensures important data assets are formally managed throughout the
enterprise.
Ensures data can be trusted
People made accountable for low data quality
It is about putting people and technology in place to fix and
preventing issues with data so that the enterprise can become
more efficient.
16. The Components of Data Governance
Organization
Metadata
• This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc.
• Definitions, lineage (where does this data come
from), business definitions, technical metadata
Privacy/Security
• Identify and control sensitive data, regulatory compliance
Data Quality and
Monitoring
• Data must be complete and correct. Measure, improve,
certify
Business Process Integration • Policies around data frequency, source availability, etc.
Master Data Management
Information Lifecycle
Management (ILM)
• Ensure consistent business critical data i.e.
Members, Providers, Agents, etc.
• Data retention, purge schedule, storage/archiving
17. What’s Old is New Again
Before Data Warehousing Data Governance
Users trying to produce reports from raw source data
No Data Conformance
No Master Data Management
No Data Quality processes
No Trust: Two analysts were almost guaranteed to come up
with two different sets of numbers!
Before Big Data Governance
We can put “anything” in Hadoop
We can analyze anything
We’re scientists, we don’t need IT, we make the rules
Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or
data governance will create a mess
Rule #2: Information harvested from an ungoverned systems will take us back to
the old days: No Trust = Not Actionable
18. Making it Right
The promise is an “agile” data culture where communities of users are encouraged
to explore new datasets in new ways
New tools
External data
Data blending
Decentralization
With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS
We need more systemic administration
We need systems, tools to help with big data governance
This space is EXTREMELY immature!
Steps towards Big Data Governance
1. Establish difference between traditional data and big data governance
2. Establish basic rules for where new data governance can be applied
3. Establish processes for graduating the products of data science to
governance
4. Establish a set of tools to make governing Big Data feasible
19. Preventing a Data Swamp with Governance
Org and Process
Master Data
Management
Data Quality and
Monitoring
Metadata
Information
Lifecycle
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
20. The Big Data Governance Pyramid
Hadoop has different governance demands at each tier.
Only top tier of the pyramid is fully governed.
We refer to this as the Trusted tier of the Big Data Warehouse.
4
User community arbitrary queries and reporting
3
Agile business insight through data-munging,
machine learning, blending with external
data, development of to-be BDW facts
2
Data is ready to be turned
into information:
organized, well defined,
complete.
1
Raw machine data
collection, collect
everything
Big
Data
Warehouse
Fully Data Governed ( trusted)
Data Science Workspace
Metadata Catalog
ILM who has access, how long do we “manage it”
Data Quality and Monitoring Monitoring of
completeness of data
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Metadata Catalog
ILM who has access, how long to “manage it”
Data Quality and Monitoring Monitoring
of completeness of data
Metadata Catalog
ILM who has access,
how long do we “manage it”
21. Big Data Governance Realities
Full data governance can only be applied to “Structured” data
The data must have a known and well documented schema
This can include materialized endpoints such as files or tables OR
projections such as a Hive table
Governed structured data must have:
A known schema with Metadata
A known and certified lineage
A monitored, quality test, managed process for ingestion and
transformation
A governed usage Data isn’t just for enterprise BI tools anymore
We talk about unstructured data in Hadoop but more-so it’s semistructured/structured with a definable schema.
Even in the case of unstructured data, structure must be
extracted/applied in just about every case imaginable before analysis
can be performed.
22. The Data Scientists Can Help!
Provide requirements for Data Lake
Proper metadata established:
Catalog
Data Definitions
Lineage
Quality monitoring
Know and validate data
completeness
Data Science to Big Data Warehouse mapping
Full Data Governance Requirements
Provide full process lineage
Data certification process by data stewards and business owners
Ongoing Data Quality monitoring that includes Quality Checks
23. What does a Data Scientist Do, Anyway?
Writes really cool and sophisticated
algorithms that impacts the way the
business runs.
NOT
Much of the time of a Data Scientist
is spent:
Searching for the data they need
Making sense of the data
Figuring why the data looks the way is does and assessing its validity
Cleaning up all the garbage within the data so it represents true business
Combining events with Reference data to give it context
Correlating event data with other events
Finally, they write algorithms to perform mining, clustering and
predictive analytics – the sexy stuff.
24. The Non-Data Part of Big Data
Caution: Some Assembly Required
The V’s require robust tooling:
Unfortunately the toolset is pretty
thin: Some of the most hopeful tools
are brand new or in incubation!
Components like ILM have fair
tooling, others like MDM and Data
Quality are sparse
People, Processes and Business commitment is still critical!
- Apache Falcon (Incubating) promises many of the
features we need, however is fairly immature (Version 0.3).
Recommendation: Roll your own custom lifecycle management
workflow using Oozie + retention metadata
25. Master Data Management
Traditional MDM will do depending on your data size and
requirements:
Relational is awkward, extreme normalization, poor usability and
performance
NoSQL stores like HBase has benefits
If you need super high performance low millisecond response times to
incorporate into your Big Data ETL
Flexible Schema
Graph database is near perfect fit. Relationships and graph analysis bring
master data to life!
Data quality and matching processes are required
Little to no community or vendor support
More will come with YARN (more Commercial and Open Source IP
will be leveragable in Hadoop framework) Recommendation: Buy + Enhance or Build.
26. Master Data Management Components
User
Interface
Customers
Security
Vendors
Services
Data
Products
Employees
Transactions?
Rules
Workflow
Consistent Policy
Enforcement and Security
Integration with exiting
ecosystem
Data Governance through
Workflow Management
Data Quality enforcement
through metadata-driven
rules
Time-Variant Hierarchies
and attributes
High
Performance, Flexible, Scala
ble Database – Think Graph!
27. Mastering Data
Staging
Library
Validation
ID
123
ABC
XYZ
ID
123
ABC
XYZ
Name
Jim Stagnitto
J. Stagnitto
James Stag
Home Address
123 Main St
132 Main Street
NULL
Standardization
Consolidated
Library
Source
SYS A
SYS B
SYS C
Source
SYS A
SYS B
SYS C
Name
Jim Stagnitto
J. Stagnitto
James Stag
Home Address
123 Main St
132 Main Street
NULL
Birth Date
8/20/1959
8/20/1959
8/20/1959
SSN
123-45-6789
123-45-6789
NULL
Std Name
James Stagnitto
James Stagnitto
James Stag
Birth Date
8/20/1959
8/20/1959
8/20/1959
SSN
123-45-6789
123-45-6789
NULL
Matching
Std Addr
123 Main Street
132 Main Street
NULL
MDM ID
1
1
1
Survivorship
Integrated
Library
MDM ID
1
Name
Home Address
James Stagnitto 123 Main Street
Birth Date SSN
8/20/1959 123-45-6789
29. Graph Databases (NoSQL) to the Rescue
Hierarchical relationships are never
rigid
Relational models with tables and
columns not flexible enough
Neo4j is the leading graph database
Many MDM systems are going graph:
Pitney Bowes - Spectrum MDM
Reltio - Worry-Free Data for Life Sciences.
Proprietary Information
30. Big Data Security
Determining Who Sees What:
Need to be able to secure as many data types as possible
Auto-discovery important!
Current products:
Sentry – SQL security semantics to Hive
Knox – Central authentication mechanism to Hadoop
Cloudera Navigator – Central security auditing
Hadoop - Good old *NIX permission with LDAP
Dataguise – Auto-discovery, masking, encryption
Datameer – The BI Tool for Hadoop
Recommendation: Assemble based on existing tools
31. Metadata
• For now Hive Metastore, HCatalog + Custom might be best
• HCatalog gives great “abstraction” services
• Maps to a relational schema
• Developers don’t need to worry about data formats and
storage
• Can use SuperLuminate to get started
Recommendation: Leverage HCatalog + Custom metadata tables
32. The Twitter Way
Twitter was suffering from a data science wild west.
Developed their own enterprise Data Access Layer (DAL)
They gave
developers and data
scientists a reason to
use it:
•
•
•
•
Easy to use storage
handlers
Automatic partitioning
Schema backwards
compatibility
Monitoring and
dependency Checks
33. Data Quality and Monitoring
To TRUST your information a robust set of tools for continuous
monitoring is needed
Accuracy and completeness of data must be ensured.
Any piece of information in the Big Data Warehouse must have
monitoring:
Basic Stats: source to target counts
Error Events: did we trap any errors during processing
Business Checks: is the metric “within expectations”, How
does it compare with an abridged alternate calculation.
Large gap in commercial projects /open source project offerings
34. Data Quality and Monitoring Recommendation
• BUILD a robust data quality
subsystem:
• HBase for metadata and error
event facts
• Oozie for orchestration
• Based on Data Warehouse ETL
Toolkit
DQ ENGINE
DQ
metadata
Quality
Check
Builder
Hive
Pig
DQ
Events and
Timeseries
Facts
DQ
Notifier
and
Logger
MR
35. Closing Thoughts – Enable the Future
Big Data requires the
convergence of data quality, data
management, data engineering
and business policies.
Make sure your data can be
trusted and people can be held
accountable for impact caused by
low data quality.
Get experts to help calm the
turbulence… it can be exhausting!
Blaze new trails!
Polyglot Persistence – “where any decent
sized enterprise will have a variety of different
data storage technologies for different kinds of
data. There will still be large amounts of it
managed in relational stores, but increasingly
we'll be first asking how we want to manipulate
the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler
We focused our attention on building a single version of the truthWe mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM.We had a fairly restrictive set of tools for using the EDW data Enterprise BI tools It was easier to GOVERN how the data would be used.