SlideShare a Scribd company logo
Big Data
Warehousing 2014
February 10,

Today’s Topic: Data Governance,
Compliance and Security in Hadoop - with
Cloudera

Sponsored By:
Agenda
7:00

Networking (15 min)
Grab some food and drink... Make some friends.

7:15

Joe Caserta (45 min)

Welcome + Intro (10 Min)

President, Caserta Concepts

About the Meetup, about Caserta Concepts

Data Governance in Big Data (35 min)
Overview of Data Governance and
implementation options in Hadoop.

8:00

Patrick Angeles (45 min)

Using Cloudera to ensure data
Chief Architect, Financial Services, governannce in Hadoop
Cloudera

8:45

Deep dive into Cloudera Data Governance Tools.

Q&A, More Networking (15 min)
Tell us what you’re up to…
Joe / Caserta Concepts Timeline
Established best practices for big
data ecosystem implementation –
Healthcare, Finance, Insurance

2013

Launched Big Data Warehousing
Meetup in NYC – 850+ Members

2012
Partnered with Big Data vendors
Cloudera, HortonWorks, Datameer, m
ore…

2010
2009

Laser focus on extending Data
Warehouses with Big Data solutions
Formalized Alliances / Partnerships –
System Integrators

Launched Big Data practice
2004
Launched Training practice, teaching
data concepts world-wide

2001

Founded Caserta Concepts in NYC

Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)

Web log analytics solution published
in Intelligent Enterprise

1996
Began consulting database
programing and data modeling

Dedicated to Data
Warehousing, Business Intelligence
since 1996

1986

25+ years hands-on experience
building database solutions
Caserta Concepts
 Technology services company with focused expertise in:
 Data Warehousing
 Business Intelligence
 Big Data Analytics

Data is all we do.
 Established in 2001:
 Industry recognized work force
 Consulting, Education, Implementation

 Broad experience across industries:
 Healthcare / Financial Services / Insurance
 Manufacturing / Higher Education / eCommerce
Implementation Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting

Big Data
Analytics

Storm
Database

BI/Visualization/
Analytics

Master Data Management
Client Portfolio
Finance. Healthcare
& Insurance

Retail/eCommerce
& Manufacturing

Education
& Services
Caserta Partners
Hadoop Distributions

Platforms/ETL

Analytics & BI
Caserta Concepts
Listed as a Top 20 Most Promising
Data Analytics Consulting Companies

CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones
who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of
CIOReview selected the Final 20.
Help Wanted
Does this word cloud excite you?

Cassandra

Big Data Architect

Storm

Hbase

Speak with us about our open positions: leslie@casertaconcepts.com
Innovation is the only sustainable
competitive advantage a company can have.
About the BDW Meetup
• Big Data is a complex, rapidly changing

landscape

• We want to share our stories and hear about

yours

• Great networking opportunity for like minded

data nerds

• Opportunities to collaborate on exciting

projects

• Founded by Caserta Concepts, DW, BI & Big

Data Analytics Consulting

• Next BDW Meetup: March 25, 2014
• New Work City, Broadway & Canal
• Topic TBD, Suggestions?
Why Big Data?
Enrollments

Claims

Traditional BI
ETL

Traditional
EDW

Big Data Analytics
Finance
ETL

Ad-Hoc/Canned
Reporting

Others…
Data Science

Big Data Cluster

NoSQL
Databases

Ad-Hoc Query

Mahout

N1

MapReduce

N2

N3

Pig/Hive

N4

N5

Hadoop Distributed File System (HDFS)
Horizontally Scalable Environment - Optimized for Analytics

Canned Reporting
Quick Vocabulary Lesson
Hadoop Distribution: Cloudera, MapR, Hortonworks, Pivotal-HD
 Tools:
 Whirr: Used to launch/kill computing clusters









Kafka: Publish-subscribe messaging system
Mahout: Machine learning
Hive: Map data to structures and use SQL-like queries
Pig: Data transformation language for big data, from Yahoo
Sqoop: Extracts external sources and loads Hadoop
Zookeeper: Used to manage & administer Hadoop
Storm: Real-time ETL

NoSQL:

 Document: MongoDB, CouchDB
 Graph: Neo4j, Titan
 Key Value: Riak, Redis
 Columnar: Cassandra, Hbase

 Languages: Python, SciPy, Java
The Challenges With Big Data
• Data volume is higher
so the process must
be more reliant on
programmatic
administration
• Less people/process
dependence

Volume

Veracity
• Dealing with sparse,
incomplete, volatile,
and highly
manufactured data.
How do you certify
sentiment analysis?

Variety

• Wider breadth of
datasets and sources
in scope requires
larger data
governance support
• Data governance
• cannot start at the
warehouse

Velocity
• Data is coming in so
fast, how do we
monitor it?
• Real real-time analytics
• What does “complete”
mean
Why is Big Data Governance Important?
 Convergence of
 Data quality
 Management and policies
 All data in an organization?

 Set of processes
 Ensures important data assets are formally managed throughout the
enterprise.
 Ensures data can be trusted
 People made accountable for low data quality

It is about putting people and technology in place to fix and
preventing issues with data so that the enterprise can become
more efficient.
The Components of Data Governance
Organization
Metadata

• This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc.
• Definitions, lineage (where does this data come
from), business definitions, technical metadata

Privacy/Security

• Identify and control sensitive data, regulatory compliance

Data Quality and
Monitoring

• Data must be complete and correct. Measure, improve,
certify

Business Process Integration • Policies around data frequency, source availability, etc.
Master Data Management
Information Lifecycle
Management (ILM)

• Ensure consistent business critical data i.e.
Members, Providers, Agents, etc.
• Data retention, purge schedule, storage/archiving
What’s Old is New Again
 Before Data Warehousing Data Governance






Users trying to produce reports from raw source data
No Data Conformance
No Master Data Management
No Data Quality processes
No Trust: Two analysts were almost guaranteed to come up
with two different sets of numbers!

 Before Big Data Governance
 We can put “anything” in Hadoop
 We can analyze anything
 We’re scientists, we don’t need IT, we make the rules
 Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or
data governance will create a mess
 Rule #2: Information harvested from an ungoverned systems will take us back to
the old days: No Trust = Not Actionable
Making it Right
 The promise is an “agile” data culture where communities of users are encouraged
to explore new datasets in new ways
 New tools
 External data
 Data blending
 Decentralization
 With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS
 We need more systemic administration
 We need systems, tools to help with big data governance
 This space is EXTREMELY immature!
 Steps towards Big Data Governance
1. Establish difference between traditional data and big data governance
2. Establish basic rules for where new data governance can be applied
3. Establish processes for graduating the products of data science to
governance
4. Establish a set of tools to make governing Big Data feasible
Preventing a Data Swamp with Governance
Org and Process
Master Data
Management
Data Quality and
Monitoring
Metadata
Information
Lifecycle

• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)

• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms

• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution

• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables

• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
The Big Data Governance Pyramid
 Hadoop has different governance demands at each tier.
 Only top tier of the pyramid is fully governed.
 We refer to this as the Trusted tier of the Big Data Warehouse.

4

User community arbitrary queries and reporting

3

Agile business insight through data-munging,
machine learning, blending with external
data, development of to-be BDW facts

2

Data is ready to be turned
into information:
organized, well defined,
complete.

1

Raw machine data
collection, collect
everything

Big
Data
Warehouse

Fully Data Governed ( trusted)

Data Science Workspace

Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data

Data Lake – Integrated Sandbox

Landing Area – Source Data in “Full Fidelity”

Metadata  Catalog
ILM  who has access, how long to “manage it”
Data Quality and Monitoring  Monitoring
of completeness of data
Metadata  Catalog
ILM  who has access,
how long do we “manage it”
Big Data Governance Realities
 Full data governance can only be applied to “Structured” data
 The data must have a known and well documented schema
 This can include materialized endpoints such as files or tables OR
projections such as a Hive table
 Governed structured data must have:
 A known schema with Metadata
 A known and certified lineage
 A monitored, quality test, managed process for ingestion and
transformation
 A governed usage  Data isn’t just for enterprise BI tools anymore
 We talk about unstructured data in Hadoop but more-so it’s semistructured/structured with a definable schema.
 Even in the case of unstructured data, structure must be
extracted/applied in just about every case imaginable before analysis
can be performed.
The Data Scientists Can Help!
 Provide requirements for Data Lake
 Proper metadata established:
 Catalog
 Data Definitions
 Lineage
 Quality monitoring
 Know and validate data
completeness
 Data Science to Big Data Warehouse mapping
 Full Data Governance Requirements
 Provide full process lineage
 Data certification process by data stewards and business owners
 Ongoing Data Quality monitoring that includes Quality Checks
What does a Data Scientist Do, Anyway?
 Writes really cool and sophisticated
algorithms that impacts the way the
business runs.
 NOT
 Much of the time of a Data Scientist
is spent:
 Searching for the data they need
 Making sense of the data
 Figuring why the data looks the way is does and assessing its validity
 Cleaning up all the garbage within the data so it represents true business
 Combining events with Reference data to give it context
 Correlating event data with other events
 Finally, they write algorithms to perform mining, clustering and
predictive analytics – the sexy stuff.
The Non-Data Part of Big Data
Caution: Some Assembly Required
The V’s require robust tooling:
 Unfortunately the toolset is pretty
thin: Some of the most hopeful tools
are brand new or in incubation!
 Components like ILM have fair
tooling, others like MDM and Data
Quality are sparse
People, Processes and Business commitment is still critical!


- Apache Falcon (Incubating) promises many of the
features we need, however is fairly immature (Version 0.3).

Recommendation: Roll your own custom lifecycle management
workflow using Oozie + retention metadata
Master Data Management
 Traditional MDM will do depending on your data size and
requirements:
 Relational is awkward, extreme normalization, poor usability and
performance

NoSQL stores like HBase has benefits
 If you need super high performance low millisecond response times to
incorporate into your Big Data ETL
 Flexible Schema
 Graph database is near perfect fit. Relationships and graph analysis bring
master data to life!

Data quality and matching processes are required
Little to no community or vendor support
More will come with YARN (more Commercial and Open Source IP
will be leveragable in Hadoop framework) Recommendation: Buy + Enhance or Build.
Master Data Management Components

User
Interface

Customers

Security

Vendors

Services

Data
Products

Employees

Transactions?
Rules

Workflow

 Consistent Policy
Enforcement and Security
 Integration with exiting
ecosystem
 Data Governance through
Workflow Management
 Data Quality enforcement
through metadata-driven
rules
 Time-Variant Hierarchies
and attributes
 High
Performance, Flexible, Scala
ble Database – Think Graph!
Mastering Data
Staging
Library

Validation

ID
123
ABC
XYZ

ID
123
ABC
XYZ

Name
Jim Stagnitto
J. Stagnitto
James Stag

Home Address
123 Main St
132 Main Street
NULL

Standardization

Consolidated
Library

Source
SYS A
SYS B
SYS C

Source
SYS A
SYS B
SYS C

Name
Jim Stagnitto
J. Stagnitto
James Stag

Home Address
123 Main St
132 Main Street
NULL

Birth Date
8/20/1959
8/20/1959
8/20/1959

SSN
123-45-6789
123-45-6789
NULL

Std Name
James Stagnitto
James Stagnitto
James Stag

Birth Date
8/20/1959
8/20/1959
8/20/1959

SSN
123-45-6789
123-45-6789
NULL

Matching

Std Addr
123 Main Street
132 Main Street
NULL

MDM ID
1
1
1

Survivorship
Integrated
Library

MDM ID
1

Name
Home Address
James Stagnitto 123 Main Street

Birth Date SSN
8/20/1959 123-45-6789
The Reality of Mastering Data
Graph Databases (NoSQL) to the Rescue
 Hierarchical relationships are never
rigid
 Relational models with tables and
columns not flexible enough
 Neo4j is the leading graph database
 Many MDM systems are going graph:
 Pitney Bowes - Spectrum MDM
 Reltio - Worry-Free Data for Life Sciences.

Proprietary Information
Big Data Security
 Determining Who Sees What:
 Need to be able to secure as many data types as possible

 Auto-discovery important!

 Current products:
 Sentry – SQL security semantics to Hive
 Knox – Central authentication mechanism to Hadoop
 Cloudera Navigator – Central security auditing
 Hadoop - Good old *NIX permission with LDAP
 Dataguise – Auto-discovery, masking, encryption
 Datameer – The BI Tool for Hadoop
Recommendation: Assemble based on existing tools
Metadata
• For now Hive Metastore, HCatalog + Custom might be best
• HCatalog gives great “abstraction” services
• Maps to a relational schema
• Developers don’t need to worry about data formats and
storage
• Can use SuperLuminate to get started
Recommendation: Leverage HCatalog + Custom metadata tables
The Twitter Way
 Twitter was suffering from a data science wild west.
 Developed their own enterprise Data Access Layer (DAL)
They gave
developers and data
scientists a reason to
use it:
•
•
•
•

Easy to use storage
handlers
Automatic partitioning
Schema backwards
compatibility
Monitoring and
dependency Checks
Data Quality and Monitoring
 To TRUST your information a robust set of tools for continuous
monitoring is needed
 Accuracy and completeness of data must be ensured.
 Any piece of information in the Big Data Warehouse must have
monitoring:
 Basic Stats: source to target counts
 Error Events: did we trap any errors during processing
 Business Checks: is the metric “within expectations”, How
does it compare with an abridged alternate calculation.
Large gap in commercial projects /open source project offerings
Data Quality and Monitoring Recommendation
• BUILD a robust data quality
subsystem:
• HBase for metadata and error
event facts
• Oozie for orchestration
• Based on Data Warehouse ETL
Toolkit
DQ ENGINE
DQ
metadata

Quality
Check
Builder

Hive
Pig

DQ
Events and
Timeseries
Facts

DQ
Notifier
and
Logger

MR
Closing Thoughts – Enable the Future
 Big Data requires the
convergence of data quality, data
management, data engineering
and business policies.
 Make sure your data can be
trusted and people can be held
accountable for impact caused by
low data quality.
 Get experts to help calm the
turbulence… it can be exhausting!
 Blaze new trails!

Polyglot Persistence – “where any decent
sized enterprise will have a variety of different
data storage technologies for different kinds of
data. There will still be large amounts of it
managed in relational stores, but increasingly
we'll be first asking how we want to manipulate
the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler
Thank You

Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

More Related Content

What's hot

Deploying a Governed Data Lake
Deploying a Governed Data LakeDeploying a Governed Data Lake
Deploying a Governed Data Lake
WaterlineData
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
Cloudera, Inc.
 
Stream based Data Integration
Stream based Data IntegrationStream based Data Integration
Stream based Data Integration
Jeffrey T. Pollock
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Jeffrey T. Pollock
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?
DATAVERSITY
 
The principles of the business data lake
The principles of the business data lakeThe principles of the business data lake
The principles of the business data lake
Capgemini
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
 
Flash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lonFlash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lon
Jeffrey T. Pollock
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
Caserta
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data Lakes
DataWorks Summit
 
Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance Initiative
DataWorks Summit
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A Comparison
Capgemini
 
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
DataWorks Summit/Hadoop Summit
 
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
Webinar - Data Lake Management: Extending Storage and Lifecycle of DataWebinar - Data Lake Management: Extending Storage and Lifecycle of Data
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
Zaloni
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
Caserta
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
DATAVERSITY
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
Mark van Rijmenam
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 

What's hot (20)

Deploying a Governed Data Lake
Deploying a Governed Data LakeDeploying a Governed Data Lake
Deploying a Governed Data Lake
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Stream based Data Integration
Stream based Data IntegrationStream based Data Integration
Stream based Data Integration
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
 
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
Unlocking Big Data Silos in the Enterprise or the Cloud (Con7877)
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?
 
The principles of the business data lake
The principles of the business data lakeThe principles of the business data lake
The principles of the business data lake
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 
Flash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lonFlash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lon
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data Lakes
 
Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance Initiative
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A Comparison
 
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
 
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
Webinar - Data Lake Management: Extending Storage and Lifecycle of DataWebinar - Data Lake Management: Extending Storage and Lifecycle of Data
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 

Viewers also liked

Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
Spark Summit
 
Hadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeHadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality Challenge
Inside Analysis
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
QAware GmbH
 
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
CA Technologies
 
Cloud Camp Azure概要
Cloud Camp Azure概要Cloud Camp Azure概要
Cloud Camp Azure概要
Daiyu Hatakeyama
 
1st step LogicFlow
1st step LogicFlow1st step LogicFlow
1st step LogicFlow
Tomoyuki Obi
 
IBM CEC Big Data 2011 06-11 final
IBM CEC Big Data 2011 06-11 finalIBM CEC Big Data 2011 06-11 final
IBM CEC Big Data 2011 06-11 final
COMMON Europe
 
D5 crazy speed web development
D5 crazy speed web developmentD5 crazy speed web development
D5 crazy speed web development
NAVER D2
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Lucidworks
 
Rapid Infrastructure Provisioning
Rapid Infrastructure ProvisioningRapid Infrastructure Provisioning
Rapid Infrastructure Provisioning
Uchit Vyas ☁
 
Roadmap to data driven advice michael goedhart 1v0
Roadmap to data driven advice michael goedhart 1v0Roadmap to data driven advice michael goedhart 1v0
Roadmap to data driven advice michael goedhart 1v0
BigDataExpo
 
Big Data Expo 2015 - Teradata Big Data : Just use it!
Big Data Expo 2015 - Teradata Big Data : Just use it!Big Data Expo 2015 - Teradata Big Data : Just use it!
Big Data Expo 2015 - Teradata Big Data : Just use it!
BigDataExpo
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
Big Data Joe™ Rossi
 
How to Collect and Process Data Under GDPR?
How to Collect and Process Data Under GDPR?How to Collect and Process Data Under GDPR?
How to Collect and Process Data Under GDPR?
Piwik PRO
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
BigDataExpo
 
Gaining visibility into your Openshift application container platform with Dy...
Gaining visibility into your Openshift application container platform with Dy...Gaining visibility into your Openshift application container platform with Dy...
Gaining visibility into your Openshift application container platform with Dy...
Dynatrace
 
E learning: kansen en risico's
E learning: kansen en risico'sE learning: kansen en risico's
E learning: kansen en risico's
Jurgen Gaeremyn
 

Viewers also liked (20)

Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
 
Hadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeHadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality Challenge
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
 
ecdevday7
ecdevday7ecdevday7
ecdevday7
 
Cloud Camp Azure概要
Cloud Camp Azure概要Cloud Camp Azure概要
Cloud Camp Azure概要
 
stagerapport2.3
stagerapport2.3stagerapport2.3
stagerapport2.3
 
1st step LogicFlow
1st step LogicFlow1st step LogicFlow
1st step LogicFlow
 
IBM CEC Big Data 2011 06-11 final
IBM CEC Big Data 2011 06-11 finalIBM CEC Big Data 2011 06-11 final
IBM CEC Big Data 2011 06-11 final
 
D5 crazy speed web development
D5 crazy speed web developmentD5 crazy speed web development
D5 crazy speed web development
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
 
Rapid Infrastructure Provisioning
Rapid Infrastructure ProvisioningRapid Infrastructure Provisioning
Rapid Infrastructure Provisioning
 
Roadmap to data driven advice michael goedhart 1v0
Roadmap to data driven advice michael goedhart 1v0Roadmap to data driven advice michael goedhart 1v0
Roadmap to data driven advice michael goedhart 1v0
 
Big Data Expo 2015 - Teradata Big Data : Just use it!
Big Data Expo 2015 - Teradata Big Data : Just use it!Big Data Expo 2015 - Teradata Big Data : Just use it!
Big Data Expo 2015 - Teradata Big Data : Just use it!
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
 
How to Collect and Process Data Under GDPR?
How to Collect and Process Data Under GDPR?How to Collect and Process Data Under GDPR?
How to Collect and Process Data Under GDPR?
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
Gaining visibility into your Openshift application container platform with Dy...
Gaining visibility into your Openshift application container platform with Dy...Gaining visibility into your Openshift application container platform with Dy...
Gaining visibility into your Openshift application container platform with Dy...
 
E learning: kansen en risico's
E learning: kansen en risico'sE learning: kansen en risico's
E learning: kansen en risico's
 

Similar to Data Governance, Compliance and Security in Hadoop with Cloudera

Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
Caserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
Caserta
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Caserta
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
Caserta
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
Caserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
Caserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
Caserta
 
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Con LA
 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
DataScienceConferenc1
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
Caserta
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
Sai Paravastu
 
Why data governance is the new buzz?
Why data governance is the new buzz?Why data governance is the new buzz?
Why data governance is the new buzz?
Aachen Data & AI Meetup
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
Caserta
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Caserta
 
EPF-datagov-part1-1.pdf
EPF-datagov-part1-1.pdfEPF-datagov-part1-1.pdf
EPF-datagov-part1-1.pdf
cedrinemadera
 
DAMA Australia: How to Choose a Data Management Tool
DAMA Australia: How to Choose a Data Management ToolDAMA Australia: How to Choose a Data Management Tool
DAMA Australia: How to Choose a Data Management Tool
Precisely
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
Denodo
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
Moacyr Passador
 

Similar to Data Governance, Compliance and Security in Hadoop with Cloudera (20)

Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Science Salon 2018 - Building a true enterprise data governance platform...
 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Why data governance is the new buzz?
Why data governance is the new buzz?Why data governance is the new buzz?
Why data governance is the new buzz?
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
EPF-datagov-part1-1.pdf
EPF-datagov-part1-1.pdfEPF-datagov-part1-1.pdf
EPF-datagov-part1-1.pdf
 
DAMA Australia: How to Choose a Data Management Tool
DAMA Australia: How to Choose a Data Management ToolDAMA Australia: How to Choose a Data Management Tool
DAMA Australia: How to Choose a Data Management Tool
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 

More from Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
Caserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
Caserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
Caserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
Caserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
Caserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
Caserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
Caserta
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Caserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 

Recently uploaded

CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
kumarjarun2010
 
Figma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdfFigma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdf
Management Institute of Skills Development
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Zilliz
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
Ivanti
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
CiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.pptCiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.ppt
moinahousna
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
CEPTES Software Inc
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
Anant Gupta
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
aakash malhotra
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Torry Harris
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 

Recently uploaded (20)

CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
 
Figma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdfFigma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdf
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
CiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.pptCiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.ppt
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 

Data Governance, Compliance and Security in Hadoop with Cloudera

  • 1. Big Data Warehousing 2014 February 10, Today’s Topic: Data Governance, Compliance and Security in Hadoop - with Cloudera Sponsored By:
  • 2. Agenda 7:00 Networking (15 min) Grab some food and drink... Make some friends. 7:15 Joe Caserta (45 min) Welcome + Intro (10 Min) President, Caserta Concepts About the Meetup, about Caserta Concepts Data Governance in Big Data (35 min) Overview of Data Governance and implementation options in Hadoop. 8:00 Patrick Angeles (45 min) Using Cloudera to ensure data Chief Architect, Financial Services, governannce in Hadoop Cloudera 8:45 Deep dive into Cloudera Data Governance Tools. Q&A, More Networking (15 min) Tell us what you’re up to…
  • 3. Joe / Caserta Concepts Timeline Established best practices for big data ecosystem implementation – Healthcare, Finance, Insurance 2013 Launched Big Data Warehousing Meetup in NYC – 850+ Members 2012 Partnered with Big Data vendors Cloudera, HortonWorks, Datameer, m ore… 2010 2009 Laser focus on extending Data Warehouses with Big Data solutions Formalized Alliances / Partnerships – System Integrators Launched Big Data practice 2004 Launched Training practice, teaching data concepts world-wide 2001 Founded Caserta Concepts in NYC Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Web log analytics solution published in Intelligent Enterprise 1996 Began consulting database programing and data modeling Dedicated to Data Warehousing, Business Intelligence since 1996 1986 25+ years hands-on experience building database solutions
  • 4. Caserta Concepts  Technology services company with focused expertise in:  Data Warehousing  Business Intelligence  Big Data Analytics Data is all we do.  Established in 2001:  Industry recognized work force  Consulting, Education, Implementation  Broad experience across industries:  Healthcare / Financial Services / Insurance  Manufacturing / Higher Education / eCommerce
  • 5. Implementation Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Storm Database BI/Visualization/ Analytics Master Data Management
  • 6. Client Portfolio Finance. Healthcare & Insurance Retail/eCommerce & Manufacturing Education & Services
  • 8. Caserta Concepts Listed as a Top 20 Most Promising Data Analytics Consulting Companies CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones who are at the forefront of tackling the real analytics challenges. A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIOReview selected the Final 20.
  • 9. Help Wanted Does this word cloud excite you? Cassandra Big Data Architect Storm Hbase Speak with us about our open positions: leslie@casertaconcepts.com
  • 10. Innovation is the only sustainable competitive advantage a company can have.
  • 11. About the BDW Meetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts, DW, BI & Big Data Analytics Consulting • Next BDW Meetup: March 25, 2014 • New Work City, Broadway & Canal • Topic TBD, Suggestions?
  • 12. Why Big Data? Enrollments Claims Traditional BI ETL Traditional EDW Big Data Analytics Finance ETL Ad-Hoc/Canned Reporting Others… Data Science Big Data Cluster NoSQL Databases Ad-Hoc Query Mahout N1 MapReduce N2 N3 Pig/Hive N4 N5 Hadoop Distributed File System (HDFS) Horizontally Scalable Environment - Optimized for Analytics Canned Reporting
  • 13. Quick Vocabulary Lesson Hadoop Distribution: Cloudera, MapR, Hortonworks, Pivotal-HD  Tools:  Whirr: Used to launch/kill computing clusters        Kafka: Publish-subscribe messaging system Mahout: Machine learning Hive: Map data to structures and use SQL-like queries Pig: Data transformation language for big data, from Yahoo Sqoop: Extracts external sources and loads Hadoop Zookeeper: Used to manage & administer Hadoop Storm: Real-time ETL NoSQL:  Document: MongoDB, CouchDB  Graph: Neo4j, Titan  Key Value: Riak, Redis  Columnar: Cassandra, Hbase  Languages: Python, SciPy, Java
  • 14. The Challenges With Big Data • Data volume is higher so the process must be more reliant on programmatic administration • Less people/process dependence Volume Veracity • Dealing with sparse, incomplete, volatile, and highly manufactured data. How do you certify sentiment analysis? Variety • Wider breadth of datasets and sources in scope requires larger data governance support • Data governance • cannot start at the warehouse Velocity • Data is coming in so fast, how do we monitor it? • Real real-time analytics • What does “complete” mean
  • 15. Why is Big Data Governance Important?  Convergence of  Data quality  Management and policies  All data in an organization?  Set of processes  Ensures important data assets are formally managed throughout the enterprise.  Ensures data can be trusted  People made accountable for low data quality It is about putting people and technology in place to fix and preventing issues with data so that the enterprise can become more efficient.
  • 16. The Components of Data Governance Organization Metadata • This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc. • Definitions, lineage (where does this data come from), business definitions, technical metadata Privacy/Security • Identify and control sensitive data, regulatory compliance Data Quality and Monitoring • Data must be complete and correct. Measure, improve, certify Business Process Integration • Policies around data frequency, source availability, etc. Master Data Management Information Lifecycle Management (ILM) • Ensure consistent business critical data i.e. Members, Providers, Agents, etc. • Data retention, purge schedule, storage/archiving
  • 17. What’s Old is New Again  Before Data Warehousing Data Governance      Users trying to produce reports from raw source data No Data Conformance No Master Data Management No Data Quality processes No Trust: Two analysts were almost guaranteed to come up with two different sets of numbers!  Before Big Data Governance  We can put “anything” in Hadoop  We can analyze anything  We’re scientists, we don’t need IT, we make the rules  Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance will create a mess  Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No Trust = Not Actionable
  • 18. Making it Right  The promise is an “agile” data culture where communities of users are encouraged to explore new datasets in new ways  New tools  External data  Data blending  Decentralization  With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS  We need more systemic administration  We need systems, tools to help with big data governance  This space is EXTREMELY immature!  Steps towards Big Data Governance 1. Establish difference between traditional data and big data governance 2. Establish basic rules for where new data governance can be applied 3. Establish processes for graduating the products of data science to governance 4. Establish a set of tools to make governing Big Data feasible
  • 19. Preventing a Data Swamp with Governance Org and Process Master Data Management Data Quality and Monitoring Metadata Information Lifecycle • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier)
  • 20. The Big Data Governance Pyramid  Hadoop has different governance demands at each tier.  Only top tier of the pyramid is fully governed.  We refer to this as the Trusted tier of the Big Data Warehouse. 4 User community arbitrary queries and reporting 3 Agile business insight through data-munging, machine learning, blending with external data, development of to-be BDW facts 2 Data is ready to be turned into information: organized, well defined, complete. 1 Raw machine data collection, collect everything Big Data Warehouse Fully Data Governed ( trusted) Data Science Workspace Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Metadata  Catalog ILM  who has access, how long to “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it”
  • 21. Big Data Governance Realities  Full data governance can only be applied to “Structured” data  The data must have a known and well documented schema  This can include materialized endpoints such as files or tables OR projections such as a Hive table  Governed structured data must have:  A known schema with Metadata  A known and certified lineage  A monitored, quality test, managed process for ingestion and transformation  A governed usage  Data isn’t just for enterprise BI tools anymore  We talk about unstructured data in Hadoop but more-so it’s semistructured/structured with a definable schema.  Even in the case of unstructured data, structure must be extracted/applied in just about every case imaginable before analysis can be performed.
  • 22. The Data Scientists Can Help!  Provide requirements for Data Lake  Proper metadata established:  Catalog  Data Definitions  Lineage  Quality monitoring  Know and validate data completeness  Data Science to Big Data Warehouse mapping  Full Data Governance Requirements  Provide full process lineage  Data certification process by data stewards and business owners  Ongoing Data Quality monitoring that includes Quality Checks
  • 23. What does a Data Scientist Do, Anyway?  Writes really cool and sophisticated algorithms that impacts the way the business runs.  NOT  Much of the time of a Data Scientist is spent:  Searching for the data they need  Making sense of the data  Figuring why the data looks the way is does and assessing its validity  Cleaning up all the garbage within the data so it represents true business  Combining events with Reference data to give it context  Correlating event data with other events  Finally, they write algorithms to perform mining, clustering and predictive analytics – the sexy stuff.
  • 24. The Non-Data Part of Big Data Caution: Some Assembly Required The V’s require robust tooling:  Unfortunately the toolset is pretty thin: Some of the most hopeful tools are brand new or in incubation!  Components like ILM have fair tooling, others like MDM and Data Quality are sparse People, Processes and Business commitment is still critical!  - Apache Falcon (Incubating) promises many of the features we need, however is fairly immature (Version 0.3). Recommendation: Roll your own custom lifecycle management workflow using Oozie + retention metadata
  • 25. Master Data Management  Traditional MDM will do depending on your data size and requirements:  Relational is awkward, extreme normalization, poor usability and performance NoSQL stores like HBase has benefits  If you need super high performance low millisecond response times to incorporate into your Big Data ETL  Flexible Schema  Graph database is near perfect fit. Relationships and graph analysis bring master data to life! Data quality and matching processes are required Little to no community or vendor support More will come with YARN (more Commercial and Open Source IP will be leveragable in Hadoop framework) Recommendation: Buy + Enhance or Build.
  • 26. Master Data Management Components User Interface Customers Security Vendors Services Data Products Employees Transactions? Rules Workflow  Consistent Policy Enforcement and Security  Integration with exiting ecosystem  Data Governance through Workflow Management  Data Quality enforcement through metadata-driven rules  Time-Variant Hierarchies and attributes  High Performance, Flexible, Scala ble Database – Think Graph!
  • 27. Mastering Data Staging Library Validation ID 123 ABC XYZ ID 123 ABC XYZ Name Jim Stagnitto J. Stagnitto James Stag Home Address 123 Main St 132 Main Street NULL Standardization Consolidated Library Source SYS A SYS B SYS C Source SYS A SYS B SYS C Name Jim Stagnitto J. Stagnitto James Stag Home Address 123 Main St 132 Main Street NULL Birth Date 8/20/1959 8/20/1959 8/20/1959 SSN 123-45-6789 123-45-6789 NULL Std Name James Stagnitto James Stagnitto James Stag Birth Date 8/20/1959 8/20/1959 8/20/1959 SSN 123-45-6789 123-45-6789 NULL Matching Std Addr 123 Main Street 132 Main Street NULL MDM ID 1 1 1 Survivorship Integrated Library MDM ID 1 Name Home Address James Stagnitto 123 Main Street Birth Date SSN 8/20/1959 123-45-6789
  • 28. The Reality of Mastering Data
  • 29. Graph Databases (NoSQL) to the Rescue  Hierarchical relationships are never rigid  Relational models with tables and columns not flexible enough  Neo4j is the leading graph database  Many MDM systems are going graph:  Pitney Bowes - Spectrum MDM  Reltio - Worry-Free Data for Life Sciences. Proprietary Information
  • 30. Big Data Security  Determining Who Sees What:  Need to be able to secure as many data types as possible  Auto-discovery important!  Current products:  Sentry – SQL security semantics to Hive  Knox – Central authentication mechanism to Hadoop  Cloudera Navigator – Central security auditing  Hadoop - Good old *NIX permission with LDAP  Dataguise – Auto-discovery, masking, encryption  Datameer – The BI Tool for Hadoop Recommendation: Assemble based on existing tools
  • 31. Metadata • For now Hive Metastore, HCatalog + Custom might be best • HCatalog gives great “abstraction” services • Maps to a relational schema • Developers don’t need to worry about data formats and storage • Can use SuperLuminate to get started Recommendation: Leverage HCatalog + Custom metadata tables
  • 32. The Twitter Way  Twitter was suffering from a data science wild west.  Developed their own enterprise Data Access Layer (DAL) They gave developers and data scientists a reason to use it: • • • • Easy to use storage handlers Automatic partitioning Schema backwards compatibility Monitoring and dependency Checks
  • 33. Data Quality and Monitoring  To TRUST your information a robust set of tools for continuous monitoring is needed  Accuracy and completeness of data must be ensured.  Any piece of information in the Big Data Warehouse must have monitoring:  Basic Stats: source to target counts  Error Events: did we trap any errors during processing  Business Checks: is the metric “within expectations”, How does it compare with an abridged alternate calculation. Large gap in commercial projects /open source project offerings
  • 34. Data Quality and Monitoring Recommendation • BUILD a robust data quality subsystem: • HBase for metadata and error event facts • Oozie for orchestration • Based on Data Warehouse ETL Toolkit DQ ENGINE DQ metadata Quality Check Builder Hive Pig DQ Events and Timeseries Facts DQ Notifier and Logger MR
  • 35. Closing Thoughts – Enable the Future  Big Data requires the convergence of data quality, data management, data engineering and business policies.  Make sure your data can be trusted and people can be held accountable for impact caused by low data quality.  Get experts to help calm the turbulence… it can be exhausting!  Blaze new trails! Polyglot Persistence – “where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” -- Martin Fowler
  • 36. Thank You Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta

Editor's Notes

  1. We focused our attention on building a single version of the truthWe mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM.We had a fairly restrictive set of tools for using the EDW data  Enterprise BI tools  It was easier to GOVERN how the data would be used.
  2. Workflow:OpenSymphonyRules: DroolsDatabase: Neo4jInterface: Cytoscape