The document discusses big data technologies and techniques. It provides biographies of Peter Aiken and Micah Dalton, who have experience in data management. The presentation they are giving covers topics like why it's important to consider the messenger of big data claims, what technologies are good at, successful big data approaches, and how it can help operations. It also discusses definitions and visualizations of the big data landscape.
Implementing Big Data, NoSQL, & Hadoop - Bigger Is (Usually) Better
1. Peter Aiken, Ph.D. & Micah Dalton
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
Copyright 2017 by Data Blueprint Slide # 1
• DAMA International President 2009-2013
• DAMA International Achievement Award 2001 (with
Dr. E. F. "Ted" Codd
• DAMA International Community Award 2005
Peter Aiken, Ph.D.
• 33+ years in data management
• Repeated international recognition
• Founder, Data Blueprint (datablueprint.com)
• Associate Professor of IS (vcu.edu)
• DAMA International (dama.org)
• 10 books and dozens of articles
• Experienced w/ 500+ data
management practices
• Multi-year immersions:
– US DoD (DISA/Army/Marines/DLA)
– Nokia
– Deutsche Bank
– Wells Fargo
– Walmart
– … PETER AIKEN WITH JUANITA BILLINGS
FOREWORD BY JOHN BOTTEGA
MONETIZING
DATA MANAGEMENT
Unlocking the Value in Your Organization’s
Most Important Asset.
The Case for the
Chief Data Officer
Recasting the C-Suite to Leverage
Your MostValuable Asset
Peter Aiken and
Michael Gorman
2
Copyright 2017 by Data Blueprint Slide #
2. Micah Dalton
3Copyright 2017 by Data Blueprint Slide #
Micah is a senior business leader with twenty years of
management experience building and leading teams
to deliver results across various industries including;
financial services, public sector, non-profit and higher
education. Micah’s expertise in offering pragmatic
business solutions has made him valuable member of
client team. Micah's skills focus on using data to
drive root cause identification, analytics, strategy,
financial analysis and reporting, procurement strategy
and cost management, and operations analysis and
management. Micah helped lead the development of
Capital One’s Six Sigma program & completed his
Black Belt training. Micah also holds certifications in
Organizational Change Management (PROSCI) and
Data Management (CDMP-Associate from DAMA).
Micah earned his MBA from Duke’s Fuqua School of
Business focusing his interests in corporate finance
and business strategy. Prior to that Micah earned this
Bachelor’s degree in economics from Mary
Washington College. Additionally, Micah was a
member of the 2014 class of Leadership Metro
Richmond and has been an adjunct professor of
Marketing at the University of Mary Washington.
4Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
3. Welcome to the Post-Big Data Era!
5Copyright 2017 by Data Blueprint Slide #
Data Velocity
Data Volume
Data Variety
Big Data: Expanding on 3
Fronts at an Increasing Rate
Big Data(has something to do with Vs - doesn't it?)
• Volume
– Amount of data
• Velocity
– Speed of data in and out
• Variety
– Range of data types and sources
• 2001 Doug Laney
• Variability
– Many options or variable interpretations confound analysis
• 2011 ISRC
• Vitality
–A dynamically changing Big Data environment in which analysis and predictive models
must continually be updated as changes occur to seize opportunities as they arrive
• 2011 CIA
• Virtual
– Scoping the discussion to only include online assets
• 2012 Courtney Lambert
• Value/Veracity
• Stuart Madnick (John Norris Maguire Professor of Information Technology, MIT Sloan School of
Management & Professor of Engineering Systems, MIT School of Engineering)
6Copyright 2017 by Data Blueprint Slide #
4. The 13 V’s of Big Data
• Vast Volume of Vigorously, Verified, Vexingly, Variable,
Verbose yet Valuable, Vital, Visualized, high Velocity and
Veracity data that encourages the Vanity of the big data
experts
– Original from John Marshey – Sillicon Graphics 1998
(with contributed extensions)
7Copyright 2017 by Data Blueprint Slide #
• We have no objective
definition of big data!
– Any measurements,
claims of success,
quantifications, etc.
must be viewed
skeptically and with
suspicion!
8Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
5. I shall not today
attempt further to
define the kinds of
material but I know
it when I see it ...
(Justice Potter Stewart)
9Copyright 2017 by Data Blueprint Slide #
Big Data
10Copyright 2017 by Data Blueprint Slide #
6. Big Data
11Copyright 2017 by Data Blueprint Slide #
[ Techniques /
Technologies ]
12Copyright 2017 by Data Blueprint Slide #
Big Data
7. Big Data Techniques
• New techniques available to impact the productivity (order of
magnitude) of any analytical insight cycle that compliment,
enhance, or replace conventional (existing) analysis methods
• Big data techniques are currently characterized by:
– Continuous, instantaneously
available data sources
– Non-von Neumann
Processing (defined later in the presentation)
– Capabilities approaching
or past human comprehension
– Architecturally enhanceable
identity/security capabilities
– Other tradeoff-focused data processing
• So a good question becomes "where in our existing architecture
can we most effectively apply Big Data Techniques?"
13Copyright 2017 by Data Blueprint Slide #
The Big Data Landscape
Copyright Dave Feinleib, bigdatalandscape.com
14Copyright 2017 by Data Blueprint Slide #
8. The Big Data Landscape 2.0
15Copyright 2017 by Data Blueprint Slide #
The Big Data Landscape 3.0
Copyright Dave Feinleib, bigdatalandscape.com
16Copyright 2017 by Data Blueprint Slide #
9. Internet of Things Landscape 2016
17Copyright 2017 by Data Blueprint Slide #
18Copyright 2017 by Data Blueprint Slide #
http://blogs.cisco.com/sp/from-internet-of-things-to-web-of-things/
12. Big Data Technologies by themselves, are a One Legged Stool
23Copyright 2017 by Data Blueprint Slide #
Governance is the major means
of preventing over reliance on
one legged stools!
24Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
13. Costpercomputingcycledeclining
25Copyright 2017 by Data Blueprint Slide #
26Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
14. 10X+++ rapid access
27Copyright 2017 by Data Blueprint Slide #
"There’s now a blurring between the storage world and the memory world"
• Faster processors outstripped
not only the hard disk, but main
memory
– Hard disk too slow
– Memory too small
• Flash drives remove both
bottlenecks
– Combined Apple and Yahoo have
spend more than $500 million to
date
• Make it look like traditional
storage or more system
memory
– Minimum 10x improvements
– Dragonstone server is 3.2 tb flash
memory (Facebook)
• Bottom line - new capabilities!
28Copyright 2017 by Data Blueprint Slide #
15. Non-von Neumann Processing/Efficiencies
• von Neumann
bottleneck
(computer science)
– "An inefficiency inherent in
the design of any von
Neumann machine that
arises from the fact that
most computer time is
spent in moving
information between
storage and the central
processing unit rather than
operating on it"
[http://encyclopedia2.thefreedictionary.com/von+Neumann+bottleneck]
• Michael Stonebraker
– Ingres (Berkeley/MIT)
– Modern database
processing is
approximately 4%
efficient
• Many big data
architectures are
attempts to address
this, but:
– Zero sum game
– Trade characteristics
against each other
• Reliability
• Predictability
– Google/MapReduce/
Bigtable
– Amazon/Dynamo
– Netflix/Chaos Monkey
– Hadoop
– McDipper
• Big data techniques
exploit non-von
Neumann processing
29Copyright 2017 by Data Blueprint Slide #
30
What is NoSQL?
Copyright 2017 by Data Blueprint Slide #
• Commonly interpreted as both "No SQL" and "Not Only SQL
• Broad class of database management technologies that
provide a mechanism for storage and retrieval of data that
doesn’t follow traditional relational database methodology.
• Motivations
– Simplicity of design
– Horizontal scaling
– Finer control over availability of the data.
• The data structures used by NoSQL databases differ from
those used in relational databases, making some operations
faster in NoSQL and others
faster in relational
databases
16. What is Hadoop?
• A data storage and processing
system, that runs on clusters of commodity servers.
• Able to store any kind of data in its native format.
• Perform a wide variety of analyses and transformations.
• Store terabytes, and even petabytes, of data
inexpensively.
• Handles hardware and system failures automatically,
without losing data or interrupting data analyses.
• Critical components of Hadoop:
– HDFS- The Hadoop Distributed File System is the storage system
for a Hadoop cluster, responsible for distribution of data across the
servers.
– Mapreduce- The inner workings of Hadoop that allows for distributed
and parallel analytical job execution.
31Copyright 2017 by Data Blueprint Slide #
One of Data Blueprint's Big Data Clusters
32Copyright 2017 by Data Blueprint Slide #
17. Why NoSQL? Why Hadoop?
• Large number of users (read: the internet)
• Rapid app development and deployment
• Large number of mission critical writes (sensors/etc)
• Small, continuous reads and writes, especially where
“Consistency” is less important (social networks)
• Hadoop solves the hard scaling problems caused by large
amounts of complex data.
• As the amount of data in a cluster grows,
new servers can be added to a Hadoop
cluster incrementally and inexpensively
to store and analyze it.
33Copyright 2017 by Data Blueprint Slide #
Hadoop Use Cases in the Real World
• Risk Modeling
• Customer Churn Analysis
• Recommendation Engine
• Ad Targeting
• Point of Sale Transaction Analysis
• Social Sentiment on Social Media
• Analyzing network data to predict failure
• Threat analysis
• Trade Surveillance
34Copyright 2017 by Data Blueprint Slide #
18. 35Copyright 2017 by Data Blueprint Slide #
http://blogs.informatica.com/perspectives/uk/2011/08/09/hadoop-enriches-data-science-part-2-of-hadoop-series/
Potential Tradeoffs:
CAP theorem: consistency, availability and partition-tolerance
36Copyright 2017 by Data Blueprint Slide #
Partition
(Fault)
Tolerance
Availability
Consistency
RDBMS
NOSQL
Atomicity
Consistency
Isolation
Durability
Basic
Availability
Soft-state
Eventual consistency
19. 37Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
Pacman
• Decomposition
• Reassembly
– not optional!
38Copyright 2017 by Data Blueprint Slide #
20. 39Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
Sandwich use case
• Landing Zone (less
expensive)
– Especially useful in cases were
data is highly disposable
• Existing technologies are the
– Contents sandwiched and
complemented landing zone and
archival capabilities
• Archiving/Offloading (less
need for structure)
– "Cold" transactional and analytic
data
Adapted from Nancy Kopp:
http://ibmdatamag.com/2013/08/relishing-the-big-data-burger/
40Copyright 2017 by Data Blueprint Slide #
Landing_Zone
Archiving_Offloading
Existing
Data Architectural
Processing
21. See Like a Snake
41Copyright 2017 by Data Blueprint Slide #
42Copyright 2017 by Data Blueprint Slide #
22. Pit Organ
43Copyright 2017 by Data Blueprint Slide #
They can switch back and forth
between those two systems, or
use both simultaneously, giving
them a leg up, so to speak,
when it comes to targeting a
warm object.
Pit Organ
44Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
23.
<-Feedback
Discernm
ent
Exploitable
Insight
• Patterns/objects,
hypotheses emerge
– What can be observed?
• Operationalizing
– The dots can be
repeatedly connected
Analytics Insight Cycle
!
Exis&ng!
Knowledge
/base
• Things are happening
– Sensemaking
techniques address
"what" is happening?
• Patterns/objects,
hypotheses emerge
– What can be observed?
• Operationalizing
– The dots can be
repeatedly connected
– "Big Data" contributions
are shown in orange
• Margaret Boden's
computational
creativity
– Exploratory
– Combinational
– Transformational
45Copyright 2017 by Data Blueprint Slide #
Volume
Velocity
Variety
Potential/
actual
insights
Pattern/Object
Emergence
Analytical
bottleneck
C
om
bined/
inform
ed
insights
"Sensemaking"
Techniques
Humans Generally Better Machines Generally Better
• Sense low level stimuli
• Detect stimuli in noisy background
• Recognize constant patterns in varying situations
• Sense unusual and unexpected events
• Remember principles and strategies
• Retrieve pertinent details without a priori connection
• Draw upon experience and adapt decision to situation
• Select alternatives if original approach fails
• Reason inductively; generalize from observations
• Act in unanticipated emergencies and novel situations
• Apply principles to solve varied problems
• Make subjective evaluations
• Develop new solutions
• Concentrate on important tasks when overload occurs
• Adapt physical response to changes in situation
• Sense stimuli outside human's range
• Count or measure physical quantities
• Store quantities of coded information accurately
• Monitor prespecified events, especially infrequent
• Make rapid and consisted responses to input signals
• Recall quantities of detailed information accurately
• Retrieve pertinent detailed without a priori connection
• Process quantitative data in prespecified ways
• Perform repetitive preprogrammed actions reliably
• Exert great, highly controlled physical force
• Perform several activities simultaneously
• Maintain operations under heavy operation load
• Maintain performance over extended periods of time
J. C. R. Licklider's Man-Computer Symbiosis
46Copyright 2017 by Data Blueprint Slide #
Best approaches combines manual and automated methods!
24. 47Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
Gartner Recommendations
48Copyright 2017 by Data Blueprint Slide #
Impacts Top
RecommendationsSome of the new analytics that are made
possible by big data have no precedence,
so innovative thinking will be required to
achieve value
Treat big data projects as innovation
projects that will require change
management efforts. The business will
take time to trust new data sources and
new analytics
Creative thinking can unearth valuable
information sources already inside the
enterprise that are underused
Work with the business to conduct an
inventory of internal data sources outside
of IT's direct control, and consider
augmenting existing data that is IT
'controlled.' With an innovation mindset,
explore the potential insight that can be
gained from each of these sources
Big data technologies often create the
ability to analyze faster, but getting value
from faster analytics requires business
changes
Ensure that big data projects that improve
analytical speed always include a process
redesign effort that aims at getting
maximum benefit from that speed
Gartner 2012
25. Innovation
• Innovation is the development of new customers
value through solutions that meet new needs,
inarticulate needs, or old customer and market
needs in new ways. This is accomplished through
different or more effective products, processes,
services, technologies, or ideas that are readily
available to markets, governments, and society.
• Innovation differs from invention in that innovation
refers to the use of a better and, as a result, novel
idea or method, whereas invention refers more
directly to the creation of the idea or method itself.
• Innovation differs from improvement in that
innovation refers to the notion of doing something
different (Lat. innovare: "to change") rather than
doing the same thing better.
49Copyright 2017 by Data Blueprint Slide #
Data must be incorporated into the innovation-navigation process
50Copyright 2017 by Data Blueprint Slide #
27. 53Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
Reengineering(Objective Definition)
• How can state that you
have improved any
system?
• If you don't understand
the existing (legacy)
systems strengths and
weaknesses
• You can't use that
these to inform the new
system
• To reengineer
– You must first reverse
engineering and then
– Use that information to
architect the new system
54Copyright 2017 by Data Blueprint Slide #
Legacy System Analysis
(break down & compare)
$$$Value
New System Requirements
New System
28. 55Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
Copyright 2013 by Data Blueprint
Potential Tradeoffs:
CAP theorem: consistency, availability and partition-tolerance
56
Partition
(Fault)
Tolerance
Availability
Consistency
RDBMS
NOSQL
Small datasets can be both consistent & available
Atomicity
Consistency
Isolation
Durability
Basic
Availability
Soft-state
Eventual consistency
29. 'Throw-away' prototyping
• With 'throw-away' prototyping a small
part of the system is developed and
then given to the end user to try out
and evaluate. The user provides
feedback which can quickly be
incorporated into the development of
the main system. The prototype is
then discarded or thrown away.
57Copyright 2017 by Data Blueprint Slide #
58Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
30. David Brooks, New York Times
59
Copyright 2015 by Data Blueprint
• Data analysis struggles with the social
– Your brain is excellent at social cognition - people can
• Mirror each other’s emotional states
• Detect uncooperative behavior
• Assign value to things through emotion
– Data analysis measures the quantity of social
interactions but not the quality
• Map interactions with co-workers you see during work days
• Can't capture devotion to childhood friends seen annually
– When making (personal) decisions about social
relationships, it’s foolish to swap the amazing machine
in your skull for the crude machine on your desk
• Data struggles with context
– Decisions are embedded in sequences and contexts
– Brains think in stories - weaving together multiple
causes and multiple contexts
– Data analysis is pretty bad at
• Narratives / Emergent thinking / Explaining
• Data creates bigger haystacks
– More data leads to more statistically significant
correlations
– Most are spurious and deceive us
– Falsity grows exponentially greater amounts of data
we collect
• Big data has trouble with big problems
– For example: the economic stimulus debate
– No one has been persuaded by data to switch sides
• Data favors memes over masterpieces
– Detect when large numbers of people take an instant
liking to some cultural product
– Products are hated initially because they are unfamiliar
• Data obscures values
– Data is never raw; it’s always structured according to
somebody’s predispositions and values
Some Big Data Limitations
Maslow's Hierarchiy of Needs
60Copyright 2017 by Data Blueprint Slide #
31. You can accomplish
Advanced Data Practices
without becoming proficient
in the Foundational Data
Practices however
this will:
• Take longer
• Cost more
• Deliver less
• Present
greater
risk (with thanks to Tom DeMarco)
Data Management Practices Hierarchy
Advanced
Data
Practices
• MDM
• Mining
• Big Data
• Analytics
• Warehousing
• SOA
Foundational Data Practices
Data Platform/Architecture
Data Governance Data Quality
Data Operations
Data Management Strategy
Technologies
Capabilities
61Copyright 2017 by Data Blueprint Slide #
62Copyright 2017 by Data Blueprint Slide #
Implementing Big Data, NOSQL, & HADOOP
Demystifying Big Data: Bigger is (Usually) Better
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
• Why it is important to consider the messenger
– What is being "sold?"
– We are using the wrong vocabulary to discuss this topic
• Technically what are Big Data Technologies good at?
– Computers→ commodity-based computing infrastructure
– Flash memory is currently obeying Moore's Law
– RAM→increased processing
– Parallel-friendly approaches (lots of repeatable actions)
• Successful Big Data Approaches ...
– Innovation
– Reengineering (precise definition)
– Throw away Prototyping
• How does that help operationally?
– Solid support community
– Examples
32. Copyright 2013 by Data Blueprint
Social Sentiment Analysis
• One of the burgeoning areas
for use of Big Data / Hadoop
platforms.
• Allows for the landing of
multiple sources of
unstructured data. (Twitter,
Facebook, Linked In, etc.)
• Data than can be analyzed
with algorithms looking for
keywords that determine
positive/negative feedback
63
Copyright 2013 by Data Blueprint
64
Operational Use
• Utilize real time pricing data from multiple sources to dynamically
update the pricing for books in the Amazon Marketplace.
• Ingested data from multiple sources looking for real time changes
in price.
• Would apply predictive model to determine best price point and set
price of their books on the marketplace.
• Increased conversion rate, but created a race to the bottom
situation if not monitored
33. Copyright 2013 by Data Blueprint
65
Healthcare Example: Patient Data
• Clinical data:
– Diagnosis/prognosis/treatment
– Genetic data
• Patient demographic data
• Insurance data:
– Insurance provider
– Claims data
• Prescriptions & pharmacy information
• Physical fitness data
– Activity tracking through
smartphone apps & social media
• Health history
• Medical research data
Copyright 2013 by Data Blueprint
66
http://www.forbes.com/sites/xerox/2013/09/27/big-data-boosts-customer-loyalty-no-really/
Retail Example: Loyalty Programs & Big Data
• Companies need to understand current wants and needs AND
predict future tendencies
• Customer -> Repeat Customer -> Brand Advocate
• Customer loyalty programs & retention strategies
– Track what is being purchased and how often
– Coupons based on purchasing history
– Targeted communications, campaigns & special offers
– Social media for additional interactions
– Personalize consumer interactions
• Customer purchase history influences
product placements
– Retailers rapidly respond to consumer demands
– Product placements, planogram optimization, etc.
34. Copyright 2013 by Data Blueprint
67
References
• The Human Face of Big Data, Rick Smolan & Jennifer Erwitt, First Edition edition (November
20, 2012)
• McKinsey: Big Data: The next frontier for innovation, competition and productivity
(http://www.mckinsey.com/insights/business_technology/
big_data_the_next_frontier_for_innovation?p=1)
• The Washington Post: Five Myths about Big Data (http://articles.washingtonpost.com/
2013-08-16/opinions/41416362_1_big-data-data-crunching-marketing-analytics)
• Gartner: Gartner’s 2013 Hype Cycle for Emerging Technologies Maps Out Evolving
Relationship Between Humans and Machines (http://www.gartner.com/newsroom/id/
2575515)
• The New York Times | Opinion Pages: What Data Can’t Do (http://www.nytimes.com/
2013/02/19/opinion/brooks-what-data-cant-do.html?_r=1&)
• CIO.com: Five Steps for How to Better Manage Your Data (http://www.cio.com.au/article/
429681/five_steps_how_better_manage_your_data/)
• Business Insider: Enterprises Aren’t Spending Wildly on ‘Big Data’ But Don’t Know If It’s
Worth It Yet (http://www.businessinsider.com/enterprise-big-data-
spending-2012-11#ixzz2cdT8shhe)
• Inc.com: Big Data, Big Money: IT Industry to Increase Spending (http://www.inc.com/
kathleen-kim/big-data-spending-to-increase-for-it-industry.html)
• Forbes: Big Data Boosts Customer Loyalty. No, Really. (http://www.forbes.com/sites/xerox/
2013/09/27/big-data-boosts-customer-loyalty-no-really/)
Copyright 2013 by Data Blueprint
It’s your turn!
Use the chat feature or Twitter (#dataed) to submit
your questions to everyone now
68
Questions?
35. 10124 W. Broad Street, Suite C
Glen Allen, Virginia 23060
804.521.4056