Data Governance, Compliance and Security in Hadoop with Cloudera


Published on

In our recent Big Data Warehousing Meetup, we discussed Data Governance, Compliance and Security in Hadoop.

As the Big Data paradigm becomes more commonplace, we must apply enterprise-grade governance capabilities for critical data that is highly regulated and adhere to stringent compliance requirements. Caserta and Cloudera shared techniques and tools that enables data governance, compliance and security on Big Data.

For more information, visit

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • We focused our attention on building a single version of the truthWe mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM.We had a fairly restrictive set of tools for using the EDW data  Enterprise BI tools  It was easier to GOVERN how the data would be used.
  • Workflow:OpenSymphonyRules: DroolsDatabase: Neo4jInterface: Cytoscape
  • Data Governance, Compliance and Security in Hadoop with Cloudera

    1. 1. Big Data Warehousing 2014 February 10, Today’s Topic: Data Governance, Compliance and Security in Hadoop - with Cloudera Sponsored By:
    2. 2. Agenda 7:00 Networking (15 min) Grab some food and drink... Make some friends. 7:15 Joe Caserta (45 min) Welcome + Intro (10 Min) President, Caserta Concepts About the Meetup, about Caserta Concepts Data Governance in Big Data (35 min) Overview of Data Governance and implementation options in Hadoop. 8:00 Patrick Angeles (45 min) Using Cloudera to ensure data Chief Architect, Financial Services, governannce in Hadoop Cloudera 8:45 Deep dive into Cloudera Data Governance Tools. Q&A, More Networking (15 min) Tell us what you’re up to…
    3. 3. Joe / Caserta Concepts Timeline Established best practices for big data ecosystem implementation – Healthcare, Finance, Insurance 2013 Launched Big Data Warehousing Meetup in NYC – 850+ Members 2012 Partnered with Big Data vendors Cloudera, HortonWorks, Datameer, m ore… 2010 2009 Laser focus on extending Data Warehouses with Big Data solutions Formalized Alliances / Partnerships – System Integrators Launched Big Data practice 2004 Launched Training practice, teaching data concepts world-wide 2001 Founded Caserta Concepts in NYC Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Web log analytics solution published in Intelligent Enterprise 1996 Began consulting database programing and data modeling Dedicated to Data Warehousing, Business Intelligence since 1996 1986 25+ years hands-on experience building database solutions
    4. 4. Caserta Concepts  Technology services company with focused expertise in:  Data Warehousing  Business Intelligence  Big Data Analytics Data is all we do.  Established in 2001:  Industry recognized work force  Consulting, Education, Implementation  Broad experience across industries:  Healthcare / Financial Services / Insurance  Manufacturing / Higher Education / eCommerce
    5. 5. Implementation Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Storm Database BI/Visualization/ Analytics Master Data Management
    6. 6. Client Portfolio Finance. Healthcare & Insurance Retail/eCommerce & Manufacturing Education & Services
    7. 7. Caserta Partners Hadoop Distributions Platforms/ETL Analytics & BI
    8. 8. Caserta Concepts Listed as a Top 20 Most Promising Data Analytics Consulting Companies CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones who are at the forefront of tackling the real analytics challenges. A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIOReview selected the Final 20.
    9. 9. Help Wanted Does this word cloud excite you? Cassandra Big Data Architect Storm Hbase Speak with us about our open positions:
    10. 10. Innovation is the only sustainable competitive advantage a company can have.
    11. 11. About the BDW Meetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts, DW, BI & Big Data Analytics Consulting • Next BDW Meetup: March 25, 2014 • New Work City, Broadway & Canal • Topic TBD, Suggestions?
    12. 12. Why Big Data? Enrollments Claims Traditional BI ETL Traditional EDW Big Data Analytics Finance ETL Ad-Hoc/Canned Reporting Others… Data Science Big Data Cluster NoSQL Databases Ad-Hoc Query Mahout N1 MapReduce N2 N3 Pig/Hive N4 N5 Hadoop Distributed File System (HDFS) Horizontally Scalable Environment - Optimized for Analytics Canned Reporting
    13. 13. Quick Vocabulary Lesson Hadoop Distribution: Cloudera, MapR, Hortonworks, Pivotal-HD  Tools:  Whirr: Used to launch/kill computing clusters        Kafka: Publish-subscribe messaging system Mahout: Machine learning Hive: Map data to structures and use SQL-like queries Pig: Data transformation language for big data, from Yahoo Sqoop: Extracts external sources and loads Hadoop Zookeeper: Used to manage & administer Hadoop Storm: Real-time ETL NoSQL:  Document: MongoDB, CouchDB  Graph: Neo4j, Titan  Key Value: Riak, Redis  Columnar: Cassandra, Hbase  Languages: Python, SciPy, Java
    14. 14. The Challenges With Big Data • Data volume is higher so the process must be more reliant on programmatic administration • Less people/process dependence Volume Veracity • Dealing with sparse, incomplete, volatile, and highly manufactured data. How do you certify sentiment analysis? Variety • Wider breadth of datasets and sources in scope requires larger data governance support • Data governance • cannot start at the warehouse Velocity • Data is coming in so fast, how do we monitor it? • Real real-time analytics • What does “complete” mean
    15. 15. Why is Big Data Governance Important?  Convergence of  Data quality  Management and policies  All data in an organization?  Set of processes  Ensures important data assets are formally managed throughout the enterprise.  Ensures data can be trusted  People made accountable for low data quality It is about putting people and technology in place to fix and preventing issues with data so that the enterprise can become more efficient.
    16. 16. The Components of Data Governance Organization Metadata • This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc. • Definitions, lineage (where does this data come from), business definitions, technical metadata Privacy/Security • Identify and control sensitive data, regulatory compliance Data Quality and Monitoring • Data must be complete and correct. Measure, improve, certify Business Process Integration • Policies around data frequency, source availability, etc. Master Data Management Information Lifecycle Management (ILM) • Ensure consistent business critical data i.e. Members, Providers, Agents, etc. • Data retention, purge schedule, storage/archiving
    17. 17. What’s Old is New Again  Before Data Warehousing Data Governance      Users trying to produce reports from raw source data No Data Conformance No Master Data Management No Data Quality processes No Trust: Two analysts were almost guaranteed to come up with two different sets of numbers!  Before Big Data Governance  We can put “anything” in Hadoop  We can analyze anything  We’re scientists, we don’t need IT, we make the rules  Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance will create a mess  Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No Trust = Not Actionable
    18. 18. Making it Right  The promise is an “agile” data culture where communities of users are encouraged to explore new datasets in new ways  New tools  External data  Data blending  Decentralization  With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS  We need more systemic administration  We need systems, tools to help with big data governance  This space is EXTREMELY immature!  Steps towards Big Data Governance 1. Establish difference between traditional data and big data governance 2. Establish basic rules for where new data governance can be applied 3. Establish processes for graduating the products of data science to governance 4. Establish a set of tools to make governing Big Data feasible
    19. 19. Preventing a Data Swamp with Governance Org and Process Master Data Management Data Quality and Monitoring Metadata Information Lifecycle • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier)
    20. 20. The Big Data Governance Pyramid  Hadoop has different governance demands at each tier.  Only top tier of the pyramid is fully governed.  We refer to this as the Trusted tier of the Big Data Warehouse. 4 User community arbitrary queries and reporting 3 Agile business insight through data-munging, machine learning, blending with external data, development of to-be BDW facts 2 Data is ready to be turned into information: organized, well defined, complete. 1 Raw machine data collection, collect everything Big Data Warehouse Fully Data Governed ( trusted) Data Science Workspace Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Metadata  Catalog ILM  who has access, how long to “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it”
    21. 21. Big Data Governance Realities  Full data governance can only be applied to “Structured” data  The data must have a known and well documented schema  This can include materialized endpoints such as files or tables OR projections such as a Hive table  Governed structured data must have:  A known schema with Metadata  A known and certified lineage  A monitored, quality test, managed process for ingestion and transformation  A governed usage  Data isn’t just for enterprise BI tools anymore  We talk about unstructured data in Hadoop but more-so it’s semistructured/structured with a definable schema.  Even in the case of unstructured data, structure must be extracted/applied in just about every case imaginable before analysis can be performed.
    22. 22. The Data Scientists Can Help!  Provide requirements for Data Lake  Proper metadata established:  Catalog  Data Definitions  Lineage  Quality monitoring  Know and validate data completeness  Data Science to Big Data Warehouse mapping  Full Data Governance Requirements  Provide full process lineage  Data certification process by data stewards and business owners  Ongoing Data Quality monitoring that includes Quality Checks
    23. 23. What does a Data Scientist Do, Anyway?  Writes really cool and sophisticated algorithms that impacts the way the business runs.  NOT  Much of the time of a Data Scientist is spent:  Searching for the data they need  Making sense of the data  Figuring why the data looks the way is does and assessing its validity  Cleaning up all the garbage within the data so it represents true business  Combining events with Reference data to give it context  Correlating event data with other events  Finally, they write algorithms to perform mining, clustering and predictive analytics – the sexy stuff.
    24. 24. The Non-Data Part of Big Data Caution: Some Assembly Required The V’s require robust tooling:  Unfortunately the toolset is pretty thin: Some of the most hopeful tools are brand new or in incubation!  Components like ILM have fair tooling, others like MDM and Data Quality are sparse People, Processes and Business commitment is still critical!  - Apache Falcon (Incubating) promises many of the features we need, however is fairly immature (Version 0.3). Recommendation: Roll your own custom lifecycle management workflow using Oozie + retention metadata
    25. 25. Master Data Management  Traditional MDM will do depending on your data size and requirements:  Relational is awkward, extreme normalization, poor usability and performance NoSQL stores like HBase has benefits  If you need super high performance low millisecond response times to incorporate into your Big Data ETL  Flexible Schema  Graph database is near perfect fit. Relationships and graph analysis bring master data to life! Data quality and matching processes are required Little to no community or vendor support More will come with YARN (more Commercial and Open Source IP will be leveragable in Hadoop framework) Recommendation: Buy + Enhance or Build.
    26. 26. Master Data Management Components User Interface Customers Security Vendors Services Data Products Employees Transactions? Rules Workflow  Consistent Policy Enforcement and Security  Integration with exiting ecosystem  Data Governance through Workflow Management  Data Quality enforcement through metadata-driven rules  Time-Variant Hierarchies and attributes  High Performance, Flexible, Scala ble Database – Think Graph!
    27. 27. Mastering Data Staging Library Validation ID 123 ABC XYZ ID 123 ABC XYZ Name Jim Stagnitto J. Stagnitto James Stag Home Address 123 Main St 132 Main Street NULL Standardization Consolidated Library Source SYS A SYS B SYS C Source SYS A SYS B SYS C Name Jim Stagnitto J. Stagnitto James Stag Home Address 123 Main St 132 Main Street NULL Birth Date 8/20/1959 8/20/1959 8/20/1959 SSN 123-45-6789 123-45-6789 NULL Std Name James Stagnitto James Stagnitto James Stag Birth Date 8/20/1959 8/20/1959 8/20/1959 SSN 123-45-6789 123-45-6789 NULL Matching Std Addr 123 Main Street 132 Main Street NULL MDM ID 1 1 1 Survivorship Integrated Library MDM ID 1 Name Home Address James Stagnitto 123 Main Street Birth Date SSN 8/20/1959 123-45-6789
    28. 28. The Reality of Mastering Data
    29. 29. Graph Databases (NoSQL) to the Rescue  Hierarchical relationships are never rigid  Relational models with tables and columns not flexible enough  Neo4j is the leading graph database  Many MDM systems are going graph:  Pitney Bowes - Spectrum MDM  Reltio - Worry-Free Data for Life Sciences. Proprietary Information
    30. 30. Big Data Security  Determining Who Sees What:  Need to be able to secure as many data types as possible  Auto-discovery important!  Current products:  Sentry – SQL security semantics to Hive  Knox – Central authentication mechanism to Hadoop  Cloudera Navigator – Central security auditing  Hadoop - Good old *NIX permission with LDAP  Dataguise – Auto-discovery, masking, encryption  Datameer – The BI Tool for Hadoop Recommendation: Assemble based on existing tools
    31. 31. Metadata • For now Hive Metastore, HCatalog + Custom might be best • HCatalog gives great “abstraction” services • Maps to a relational schema • Developers don’t need to worry about data formats and storage • Can use SuperLuminate to get started Recommendation: Leverage HCatalog + Custom metadata tables
    32. 32. The Twitter Way  Twitter was suffering from a data science wild west.  Developed their own enterprise Data Access Layer (DAL) They gave developers and data scientists a reason to use it: • • • • Easy to use storage handlers Automatic partitioning Schema backwards compatibility Monitoring and dependency Checks
    33. 33. Data Quality and Monitoring  To TRUST your information a robust set of tools for continuous monitoring is needed  Accuracy and completeness of data must be ensured.  Any piece of information in the Big Data Warehouse must have monitoring:  Basic Stats: source to target counts  Error Events: did we trap any errors during processing  Business Checks: is the metric “within expectations”, How does it compare with an abridged alternate calculation. Large gap in commercial projects /open source project offerings
    34. 34. Data Quality and Monitoring Recommendation • BUILD a robust data quality subsystem: • HBase for metadata and error event facts • Oozie for orchestration • Based on Data Warehouse ETL Toolkit DQ ENGINE DQ metadata Quality Check Builder Hive Pig DQ Events and Timeseries Facts DQ Notifier and Logger MR
    35. 35. Closing Thoughts – Enable the Future  Big Data requires the convergence of data quality, data management, data engineering and business policies.  Make sure your data can be trusted and people can be held accountable for impact caused by low data quality.  Get experts to help calm the turbulence… it can be exhausting!  Blaze new trails! Polyglot Persistence – “where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” -- Martin Fowler
    36. 36. Thank You Joe Caserta President, Caserta Concepts (914) 261-3648 @joe_Caserta