@joe_Caserta
Big Data's Impact on the Enterprise
Joe Caserta
President
Caserta Concepts
@joe_Caserta
@joe_Caserta
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Data Analysis, Data Warehousing and
Business Intelligence since 1996
Began consulting database programing
and data modeling 25+ years hands-on experience
building database solutions
Founded
Caserta Concepts
Web log analytics solution published in
Intelligent Enterprise
Launched Data Science, Data
Interaction and Cloud practices
Laser focus on extending Data
Analytics with Big Data solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance
Techniques on Big Data (Innovation)
Top 20 Big Data
Consulting - CIO Review
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing
(BDW) Meetup NYC: 2,000+ Members
2015 Awarded for getting data out
of SAP for data analytics
Established best practices for big data
ecosystem implementations
Caserta Timeline
Awarded Top
Healthcare Analytics
Solution Provider
@joe_Caserta
About Caserta Concepts
• Technology innovation consulting company with expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Solve highly complex business data challenges
• Award-winning solutions
• Business Transformation
• Maximize Data Value
• Innovation partner:
• Transformative Data Strategies
• Modern Data Engineering
• Advanced Analytics
• Strategic Consulting
• Technical Architecture
• Design and Build Solutions
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
@joe_Caserta
Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthcare
& Insurance
@joe_Caserta
Caserta Innovation Lab (CIL)
• Internal laboratory established to test & develop solution concepts and
ideas
• Used to accelerate client projects
• Examples:
• Search (SOLR) based BI
• Big Data Governance Toolkit
• Text Analytics on Social Network Data
• Continuous Integration / End-to-end streaming (Spark)
• Recommendation Engine Optimization
• Relationship Intelligence (Graph DB/Search)
• Others (confidential)
• CIL is hosted on
@joe_Caserta
Community
@joe_Caserta
Recent Awards and Recognitions
@joe_Caserta
Partners
@joe_Caserta
Innovation is the only sustainable competitive advantage a company can have
Innovations may fail, but companies that don’t innovate will fail
@joe_Caserta
What’s New in Modern Data Engineering?
@joe_Caserta
What you need to know (according to Joe)
Hadoop Distribution: Apache, Cloudera, Hortonworks, MapR, IBM
 Tools:
 Hive: Map data to structures and use SQL-like queries
 Pig: Data transformation language for big data
 Sqoop: Extracts external sources and loads Hadoop
 Spark: General-purpose cluster computing framework
 Storm: Real-time ETL
 NoSQL:
 Document: MongoDB, CouchDB
 Graph: Neo4j, Titan
 Key Value: Riak, Redis
 Columnar: Cassandra, Hbase
 Search: Lucene, Solr, ElasticSearch
 Languages: Python, Java, R, Scala
@joe_Caserta
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Lake
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
The Evolution of Modern Data Engineering
Data Science
@joe_Caserta
The Progression of Analytics
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Source: Gartner
Reports  Correlations  Predictions  Recommendations
@joe_Caserta
Governing Big Data
 Before Data Governance
 Users trying to produce reports from raw source data
 No Data Conformance
 No Master Data Management
 No Data Quality processes
 No Trust: Two analysts were almost guaranteed to come up
with two different sets of numbers!
 Before Big Data Governance
 We can put “anything” in Hadoop
 We can analyze anything
 We’re scientists, we don’t need IT, we make the rules
 Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or
governance will create a mess
 Rule #2: Information harvested from an ungoverned systems will take us back to
the old days: No Trust = Not Actionable
@joe_Caserta
•This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from),
business definitions, technical metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve,
certify
Data Quality and
Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members,
Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Chief Data Officer – Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
for Big Data
@joe_Caserta
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Big Data Pyramid
Metadata  Catalog
ILM  who has access,
how long do we
“manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well
defined, complete.
Agile business insight through data-
munging, machine learning, blending
with external data, development of
to-be BDW facts
Metadata  Catalog
ILM  who has access, how long do we
“manage it”
Data Quality and Monitoring 
Monitoring of completeness of data
Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
 Data has different governance demands at each tier
 Only top tier of the pyramid is fully governed
 We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
@joe_Caserta
Defining Data Science
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Business
Expertise
Advanced
Mathematics/
Statistics
@joe_Caserta
Modern Data Quality Priorities
Be
Corrective
Be Fast
Be
Transparent
Be Thorough
@joe_Caserta
Data Quality Priorities
Data Quality
SpeedtoValue
Fast
Slow
Raw Refined
@joe_Caserta
Implementing Data Governance on Big Data
People, Processes and Business commitment is still critical!
Caution: Some Assembly Required
The V’s require robust tooling:
 Unfortunately the toolsets available
are pretty thin: Some of the most
hopeful tools are brand new or in
incubation!
Enterprise big data governance
implementations combine products
with some custom built components
@joe_Caserta
Use Cases
• Real-Time Trade Data Analytics
• Comply with Dodd-Frank
• Electronic Medical Record Analytics
• Save lives?
@joe_Caserta
High Volume Trade Data Project
• The equity trading arm of a large US bank needed to scale its
infrastructure to enable the ability to process/parse trade data real-time
and calculate aggregations/statistics
~ 1.4Million/second ~12 Billion messages/day ~240 Billon/month
• The solution needed to map the raw data to a data model in memory or
low latency (for real-time), while persisting mapped data to disk (for end
of day reporting).
• The proposed solution also needed to handle
ad-hoc data requests for data analytics.
@joe_Caserta
The Data
• Primarily FIX messages: Financial Information Exchange
• Established in early 90's as a standard for trade data communication
widely used throughout the industry
• Basically a delimited file of variable attribute-value pairs
• Looks something like this:
8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 |
11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 |
44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 |
10=128 |
• A single trade can be comprised of 100's of such messages, although
typical trades have about a dozen
@joe_Caserta
Data Quality
Rules Engine
Storm Cluster
Trade
Data
d3.js Real-time
Analytics
Hadoop Cluster
Low Latency
Analytics
Atomic data
Aggregates
Event Monitors
• The Kafka messaging system is used for ingestion
• Storm is used for real-time ETL and outputs atomic data and
derived data needed for analytics
• Redis is used as a reference data lookup cache
• Real time analytics are produced from the aggregated data.
• Higher latency ad-hoc analytics are done in Hadoop using Pig
and Hive
Kafka
High Volume Real-time Analytics
Solution Architecture
@joe_Caserta
Electronic Medical Records (EMR) Analytics
Hadoop Data LakeEdge Node
`
100k
files
variant 1..n
…
variant 1..n
HDFS
Put
Netezza
DW
Sqoop
Pig EMR
Processor
UDF
Library
Provider
table
(parquet)
Member table
(parquet)
Python Wrapper
Provider
table
Member
table
Forqlift
Sequenc
e Files
…
variant 1..n
Sequenc
e Files
…
15 More
Entities
(parquet)
More
Dimensions
And
Facts
• Receive Electronic Medial Records from various providers in various formats
• Address Hadoop ‘small file’ problem
• No barrier for onboarding and analysis of new data
• Blend new data with Data Lake and Big Data Warehouse
• Machine Learning
• Text Analytics
• Natural Language Processing
• Reporting
• Ad-hoc queries
• File ingestion
• Information Lifecycle Mgmt
@joe_Caserta
Some Thoughts – Enable the Future
 Big Data requires the convergence of
data governance, advanced data
engineering, data science and business
smarts
 Make sure your data can be trusted
and people can be held accountable for
impact caused by low data quality.
It takes a village to achieve all the tasks
required for effective big data strategy
& execution
 Get experts that have done it before!
Achieve the impossible…..
… everything is impossible until someone does is!
@joe_Caserta
Workshops: www.casertaconcepts.com/training
Sept 21-22 (2 days), Agile Data Warehousing
taught by Lawrence Corr
Sept 23-24 (2 days), ETL Architecture and Design
taught by Joe Caserta
(Big Data module added)
SAVE $300 BY REGISTERING BEFORE JULY 31ST!
Agile DW & ETL Training in NYC, 2015
@joe_Caserta
Recommended Reading
@joe_Caserta
Thank You
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

Big Data's Impact on the Enterprise

  • 1.
    @joe_Caserta Big Data's Impacton the Enterprise Joe Caserta President Caserta Concepts
  • 2.
  • 3.
    @joe_Caserta Launched Big Datapractice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts Web log analytics solution published in Intelligent Enterprise Launched Data Science, Data Interaction and Cloud practices Laser focus on extending Data Analytics with Big Data solutions 1986 2004 1996 2009 2001 2013 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Top 20 Big Data Consulting - CIO Review Top 20 Most Powerful Big Data consulting firms Launched Big Data Warehousing (BDW) Meetup NYC: 2,000+ Members 2015 Awarded for getting data out of SAP for data analytics Established best practices for big data ecosystem implementations Caserta Timeline Awarded Top Healthcare Analytics Solution Provider
  • 4.
    @joe_Caserta About Caserta Concepts •Technology innovation consulting company with expertise in: • Big Data Solutions • Data Warehousing • Business Intelligence • Solve highly complex business data challenges • Award-winning solutions • Business Transformation • Maximize Data Value • Innovation partner: • Transformative Data Strategies • Modern Data Engineering • Advanced Analytics • Strategic Consulting • Technical Architecture • Design and Build Solutions • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 5.
    @joe_Caserta Client Portfolio Retail/eCommerce & Manufacturing DigitalMedia/AdTech Education & Services Finance. Healthcare & Insurance
  • 6.
    @joe_Caserta Caserta Innovation Lab(CIL) • Internal laboratory established to test & develop solution concepts and ideas • Used to accelerate client projects • Examples: • Search (SOLR) based BI • Big Data Governance Toolkit • Text Analytics on Social Network Data • Continuous Integration / End-to-end streaming (Spark) • Recommendation Engine Optimization • Relationship Intelligence (Graph DB/Search) • Others (confidential) • CIL is hosted on
  • 7.
  • 8.
  • 9.
  • 10.
    @joe_Caserta Innovation is theonly sustainable competitive advantage a company can have Innovations may fail, but companies that don’t innovate will fail
  • 11.
    @joe_Caserta What’s New inModern Data Engineering?
  • 12.
    @joe_Caserta What you needto know (according to Joe) Hadoop Distribution: Apache, Cloudera, Hortonworks, MapR, IBM  Tools:  Hive: Map data to structures and use SQL-like queries  Pig: Data transformation language for big data  Sqoop: Extracts external sources and loads Hadoop  Spark: General-purpose cluster computing framework  Storm: Real-time ETL  NoSQL:  Document: MongoDB, CouchDB  Graph: Neo4j, Titan  Key Value: Riak, Redis  Columnar: Cassandra, Hbase  Search: Lucene, Solr, ElasticSearch  Languages: Python, Java, R, Scala
  • 13.
    @joe_Caserta Enrollments Claims Finance ETL Ad-Hoc Query Horizontally ScalableEnvironment - Optimized for Analytics Big Data Lake Canned Reporting Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… The Evolution of Modern Data Engineering Data Science
  • 14.
    @joe_Caserta The Progression ofAnalytics Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Source: Gartner Reports  Correlations  Predictions  Recommendations
  • 15.
    @joe_Caserta Governing Big Data Before Data Governance  Users trying to produce reports from raw source data  No Data Conformance  No Master Data Management  No Data Quality processes  No Trust: Two analysts were almost guaranteed to come up with two different sets of numbers!  Before Big Data Governance  We can put “anything” in Hadoop  We can analyze anything  We’re scientists, we don’t need IT, we make the rules  Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or governance will create a mess  Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No Trust = Not Actionable
  • 16.
    @joe_Caserta •This is the‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Chief Data Officer – Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations for Big Data
  • 17.
    @joe_Caserta Big Data Warehouse Data Science Workspace DataLake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” The Big Data Pyramid Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data- munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data  Data has different governance demands at each tier  Only top tier of the pyramid is fully governed  We refer to this as the Trusted tier of the Big Data Warehouse. Fully Data Governed ( trusted) User community arbitrary queries and reporting
  • 18.
    @joe_Caserta Defining Data Science ModernData Engineering/Data Preparation Domain Knowledge/Business Expertise Advanced Mathematics/ Statistics
  • 19.
    @joe_Caserta Modern Data QualityPriorities Be Corrective Be Fast Be Transparent Be Thorough
  • 20.
    @joe_Caserta Data Quality Priorities DataQuality SpeedtoValue Fast Slow Raw Refined
  • 21.
    @joe_Caserta Implementing Data Governanceon Big Data People, Processes and Business commitment is still critical! Caution: Some Assembly Required The V’s require robust tooling:  Unfortunately the toolsets available are pretty thin: Some of the most hopeful tools are brand new or in incubation! Enterprise big data governance implementations combine products with some custom built components
  • 22.
    @joe_Caserta Use Cases • Real-TimeTrade Data Analytics • Comply with Dodd-Frank • Electronic Medical Record Analytics • Save lives?
  • 23.
    @joe_Caserta High Volume TradeData Project • The equity trading arm of a large US bank needed to scale its infrastructure to enable the ability to process/parse trade data real-time and calculate aggregations/statistics ~ 1.4Million/second ~12 Billion messages/day ~240 Billon/month • The solution needed to map the raw data to a data model in memory or low latency (for real-time), while persisting mapped data to disk (for end of day reporting). • The proposed solution also needed to handle ad-hoc data requests for data analytics.
  • 24.
    @joe_Caserta The Data • PrimarilyFIX messages: Financial Information Exchange • Established in early 90's as a standard for trade data communication widely used throughout the industry • Basically a delimited file of variable attribute-value pairs • Looks something like this: 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | • A single trade can be comprised of 100's of such messages, although typical trades have about a dozen
  • 25.
    @joe_Caserta Data Quality Rules Engine StormCluster Trade Data d3.js Real-time Analytics Hadoop Cluster Low Latency Analytics Atomic data Aggregates Event Monitors • The Kafka messaging system is used for ingestion • Storm is used for real-time ETL and outputs atomic data and derived data needed for analytics • Redis is used as a reference data lookup cache • Real time analytics are produced from the aggregated data. • Higher latency ad-hoc analytics are done in Hadoop using Pig and Hive Kafka High Volume Real-time Analytics Solution Architecture
  • 26.
    @joe_Caserta Electronic Medical Records(EMR) Analytics Hadoop Data LakeEdge Node ` 100k files variant 1..n … variant 1..n HDFS Put Netezza DW Sqoop Pig EMR Processor UDF Library Provider table (parquet) Member table (parquet) Python Wrapper Provider table Member table Forqlift Sequenc e Files … variant 1..n Sequenc e Files … 15 More Entities (parquet) More Dimensions And Facts • Receive Electronic Medial Records from various providers in various formats • Address Hadoop ‘small file’ problem • No barrier for onboarding and analysis of new data • Blend new data with Data Lake and Big Data Warehouse • Machine Learning • Text Analytics • Natural Language Processing • Reporting • Ad-hoc queries • File ingestion • Information Lifecycle Mgmt
  • 27.
    @joe_Caserta Some Thoughts –Enable the Future  Big Data requires the convergence of data governance, advanced data engineering, data science and business smarts  Make sure your data can be trusted and people can be held accountable for impact caused by low data quality. It takes a village to achieve all the tasks required for effective big data strategy & execution  Get experts that have done it before! Achieve the impossible….. … everything is impossible until someone does is!
  • 28.
    @joe_Caserta Workshops: www.casertaconcepts.com/training Sept 21-22(2 days), Agile Data Warehousing taught by Lawrence Corr Sept 23-24 (2 days), ETL Architecture and Design taught by Joe Caserta (Big Data module added) SAVE $300 BY REGISTERING BEFORE JULY 31ST! Agile DW & ETL Training in NYC, 2015
  • 29.
  • 30.
    @joe_Caserta Thank You Joe Caserta President,Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta

Editor's Notes

  • #12 Last 2 years have been more exciting than previous 27
  • #15 Reports  correlations  predictions  recommendations
  • #16 We focused our attention on building a single version of the truth We mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM. We had a fairly restrictive set of tools for using the EDW data  Enterprise BI tools  It was easier to GOVERN how the data would be used.
  • #22 Volume, Variety, Veracity and Veolcity
  • #26 Spark would make this easier and could leverage same DQ code