Gilbane Boston 2011 big data

Get Ready for Big Data
Wednesday November 30, 2011
2:40 – 4:00

Peter O'Kelly
Principal Analyst, O'Kelly Associates
Hadley Reynolds
Managing Director, Next Era Research
Kathleen Reidy
Senior Analyst, 451 Research

Agenda
• Big data in context
• Big structured data
• Big unstructured data
• Big opportunities and risks
• Q&A

2

Big Data in Context
• What is “big data”?
– Unhelpfully, both “big data” and “NoSQL,” generally
considered a key part of the big data wave, are
defined more in terms of what they’re not than
what they are
– A typical big data definition (Wikipedia):
• “*…+ datasets that grow so large that they become
awkward to work with using on-hand database
management tools”

3

Big Data in Context
• With thanks to the Business SOA blog:
– “*…+ describe Big Data in the same way that
the Hitchhikers Guide to the Galaxy described space:
– ‘Space,’ it says, ‘is big. Really big. You just won't believe how
vastly, hugely, mindbogglingly big it is. I mean, you may think it's
a long way down the road to the chemist's, but that's just
peanuts to space, listen...’”

4

Big Data in Context
• Why is big data a big deal now?
– Commodity hardware and the Internet
• Capability and price/performance curves that continue to defy all
economic “laws”
• Also facilitating compelling cloud services
– Maturation and uptake of open source software, e.g., Hadoop
• Powerful and often no- or low-cost
– IT market
• Enthusiasm for “NoSQL” systems
• Frustration with incumbent information management vendors
– Useful new data sources/resources, e.g., social network activity
graphs, the “Internet of things,” sensor networks…
– Competitive and compliance imperatives

5

Big Data in Context
• A big data reality check
– “Mindbogglingly”-scale information management is not new
• Consider, e.g., VLDB, multi-billion document repositories, and the
World Wide Web…
– What is new and compelling
• The combination of market dynamics producing new capability and
price/performance curves
• Cloud
– No deep capital investment required to get started
– Cloud-based information resources
• Some innovative marketing, suggesting
– Self-proclaimed next-generation big data systems are magical and
revolutionary
– Deployed systems are obsolete and wasteful

6

A Big-Picture Framework
• A digital information item dichotomy
– Resources (~unstructured information)
• Digital artifacts optimized to convey stories
– Organized in terms of narrative, hierarchy, and sequence
• Examples: books, magazines, documents (e.g., PDF,
Word), Web pages, XBRL documents, video, hypertext…
– Relations (~structured information)
• Application-independent descriptions of real-world
things and relationships
• Examples: business domain databases, e.g., customer,
sales, HR…

7


Resource Relation

8


Resources Relations

Conceptual Resources and links Entities, attributes,
relationships, and identifiers

Logical Model: hypertext Model: extended relational
Language: XQuery (ideally) Language: SQL

Physical Indexing (e.g., scalar data types, XML, full-text), locking and
isolation levels, federation, replication, in-memory databases,
columnar storage, table spaces, caching, and more

9

Agenda
• Q&A

10

Big Structured Data
• NoSQL
• Hadoop
• RDBMS reconsidered
• Back to the bigger picture

11

NoSQL
• No clear consensus on what “NoSQL” means
– Started with what it’s against, not what it’s about
• And often finds a receptive audience due to frustration
with RDBMS business-as-usual
– The “NoSQL” meme is a moving target
• Initially implied “Just say ‘no’ to SQL”
• Later quietly redefined as “Not Only SQL”
• What may be next: “New Opportunities for SQL”
– I.e., some developers may reconsider the value of SQL and
RDBMSs, after hitting NoSQL limitations

12

A NoSQL Taxonomy
• From the NoSQL Wikipedia article:

13

NoSQL Perspectives
• The “NoSQL” meme confusingly conflates
– Document database requirements
• Best served by XML DBMS (XDBMS)
– Physical model decisions on which only DBAs and systems
architects should focus
• And which are more complementary than competitive with
RDBMS/XDBMS
– Object databases, which have floundered for decades
• But with which some application developers are nonetheless
enamored, for minimized “impedance mismatch,” despite
significant information management compromises
– Semantic models
• Also more complementary than competitive with RDBMS/XDBMS

14

Hadoop
• Hadoop is often considered central to big data
– Originating with Google’s MapReduce architecture, Apache
Hadoop is an open source architecture for distributed
processing on networks of commodity hardware
• Commercial application domains include (from Wikipedia)
– Log and/or clickstream analysis of various kinds
– Marketing analytics
– Machine learning and/or sophisticated data mining
– Image processing
– Processing of XML messages
– Web crawling and/or text processing
– General archiving, including of relational/tabular data, e.g. for
compliance

15

Hadoop
• Hadoop is popular and rapidly evolving
– Most leading information management vendors,
including Microsoft, have embraced Hadoop
– There is now a Hadoop ecosystem

16

RDBMS Reconsidered
• RDBMS incumbents appear to be under siege, with
– IT frustration with RDBMS business-as-usual
• Counterproductive RDBMS vendor policies and attitudes
• DBA modus operandi often seen as excessively conservative
– Conventional wisdom about RDBMS limitations for, e.g.,
• “Web scale”
• “Agility”
• The application/database “impedance mismatch”
– The advent of open source and/or specialized DBMSs
• E.g., MySQL is the M in the “LAMP stack”
• “The end of the one-size-fits-all DBMS era”

17

RDBMS Reconsidered
• An RDBMS reality check
– Leading RDBMS products and open source initiatives are
very powerful and flexible
• And will continue to evolve, e.g., with the mainstream deployment
of massive-memory servers and solid state disk (SSD) storage
– And they continue to expand
• E.g., in-database processing, with, for example, analytics engines
running within DBMS kernels
– But the RDBMS incumbents nonetheless face
unprecedented challenges
• Which sometimes resonate with frustrated architects and
developers because of negative experiences that have more to do
with how RDBMSs were used rather than what RDBMSs can
effectively address

18

RDBMS in the Big-Picture Framework

Resources Relations

Conceptual Resources and links Entities, attributes,
relationships, and identifiers

Logical Model: hypertext Model: extended relational
Language: XQuery Language: SQL

Physical Indexing (e.g., scalar data types, XML, full-text), locking and
isolation levels, federation, replication, in-memory databases,
columnar storage, table spaces, caching, and more

19

RDBMS Reconsidered
• A Forrester big data reality check (from “Stay
Alert To Database Technology Innovation,”
11/19/2010):
– “For 90% of BI use cases, which are often less than
50 terabytes in size, relational databases still are
good enough” (p. 4)
– “Traditional relational databases are still good
enough for the majority of transactional use
cases” (p. 5)

20

Back to the Bigger Picture
• Compared with traditional enterprise data
management, big data is
– Essentially a collection of specialized physical
models for very large, analysis-oriented data
management
– Expanding to encompass resources as well as
relations
– More about the potential for displacing expensive
and closed/proprietary distributed processing
alternatives than displacing RDBMS or XDBMS

21

Structured Big Data: Recap
• Substantive, sustainable, and synergistic
– RDBMS
– XDBMS
– Hadoop
– The cloud as an information management
platform
• Vaguely defined, transitory, and over-hyped
– NoSQL

22

Agenda
• Q&A

23

Big Unstructured Data
• Finding Facts about Data – IDC/EMC
• Patterns for Unstructured Big Data
• How-to issues – who will know?

24

http://www.emc.com/leadership/programs/digital-universe.htm 25

Facebook:
800M users
500M visitors/day
34
$100B potential value @ IPO

http://inmaps.linkedinlabs.com/ 35

Unstructured Big Data Patterns
• Search
• Social
• Mobile
• Online Activities/Digital Marketing
• Inquiry/Detection – Connecting Dots
• Question Answering

36

Mobile Adds:

Location data points
Voice searches
Siri questions
App history profile
Browse history profile
Search history profile
Past purchase profile
Camera-generated outputs/inputs
Coupon delivery & merchandising
Friends' locations
Social search
Local ad-match algo opportunities

37

Online Activities/Digital Marketing

39

• Inquiry/Detection – Connecting Dots
– Intelligence
– Law Enforcement
– Fraud Detection (Government, Financial, Health, …)
– eDiscovery
40

Social Media Monitoring

41

Question Answering

4/28/2011 42

Question Answering Beyond Jeopardy

43

Twitter Analytics Questions
• What can we tell about a user from their tweets?
– from the tweets of those they follow?
– from the tweets of their followers?
– from the ratio of followers/following
• What graph structures lead to successful networks?
• User reputation?
• Sentiment analysis?
• What features get a tweet retweeted?
– How deep is the retweet tree?
• Long term duplicate detection
• Machine learning
• Language detection
44

46
http://www.mckinsey.com/en/Features/Big_Data.aspx

Agenda
• Q&A

47

Big Data Opportunities
• Improved visibility and insights
– Can explore previously impractical questions
• Real-time analytics
– Less dependence on “dead data”
• Blur the boundaries between structured and
unstructured information
– Unified views of resources and relations
• Consolidation
– Reduce the number of moving parts in your infrastructure
• Along with related licensing and maintenance expenses
• Compliance – capture and maintain data & records
previously beyond firm's capabilities

48

Big Data Risks
• The potential for an ever-expanding set of information silos
– Critical to relentlessly focus on minimized redundancy and
optimized integration
• GIGO (garbage in, garbage out) at super-scale
– Dramatic improvements in capabilities and price/performance
provide new opportunities for self-inflicted damage, for
organizations that don’t model or query effectively
• Cognitive overreach
– The potential for information workers to create nonsensical
queries based on poorly-designed and/or misunderstood
information models
• Skills gaps create competitive disadvantages

49

Q&A

Peter O'Kelly - peter@okellyassociates.com
Kathleen Reidy - kathleen.reidy@451Research.com
Hadley Reynolds - hadley.reynolds@nexteraresearch.com

50

Database market landscape
Relational
Analytic Mapr Infobright Netezza ParAccel SAP Sybase IQ
Non-relational
Piccolo Hadoop Teradata EMC IBM InfoSphere
Dryad Brisk Greenplum
Hadapt Aster Data Calpont VectorWise HP Vertica

Operational Progress Oracle IBM DB2 SQL Server JustOne
InterSystems MarkLogic MySQL Ingres PostgreSQL
Objectivity Document
Lotus Notes McObject SAP Sybase ASE EnterpriseDB
Versant
NoSQL CouchDB NewSQL HandlerSocket Akiban
Key value MongoDB -as-a-Service MySQL Cluster
Amazon RDS
Couchbase RavenDB Cloudant App Engine SQL Azure Clustrix
Riak Datastore Database.com
Redis Drizzle
Big tables Xeround FathomDB GenieDB
Membrain SimpleDB
ScalArc
Cassandra
Voldemort Hypertable Graph Schooner MySQL CodeFutures
InfiniteGraph Tokutek ScaleBase NimbusDB
BerkeleyDB HBase Neo4J Continuent
GraphDB Translattice VoltDB

Data Grid/Cache Terracotta GigaSpaces Oracle Coherence Memcached
IBM eXtreme Scale GridGain ScaleOut Vmware GemFire InfiniSpan CloudTran

Big Data Complexity Continuum

Climate Modeling Gov’t Intelligence
And Prediction Applications
Predictions
Trend
Analytics Medical
Number & Complexity of Technologies

diagnostics
Fraud
Detection
Influence
Voice of Customer Networks
Sentiment extraction
Relationship Ad Targeting
Reputation Retargeting
Detection management
Brand
monitoring

Intelligent
Web search
Machines

Pattern Log Analysis
Data mining eCommerce
Detection

Speech to text
Time Historic Future(Predict)
52 Current (Monitor)
Horizon IDC 2005

Gilbane Boston 2011 big data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Gilbane Boston 2011 big data

Similar to Gilbane Boston 2011 big data (20)

More from Peter O'Kelly

More from Peter O'Kelly (6)

Recently uploaded

Recently uploaded (20)

Gilbane Boston 2011 big data

Editor's Notes