It is an exciting and interesting time to be involved in data. More change of influence has occurred in the database management in the last 18 months than has occurred in the last 18 years. New technologies such as NoSQL & Hadoop and radical redesigns of existing technologies, like NewSQL , will change dramatically how we manage data moving forward.
These technologies bring with them possibilities both in terms of the scale of data retained but also in how this data can be utilized as an information asset. The ability to leverage Big Data to drive deep insights will become a key competitive advantage for many organisations in the future.
Join Tony Bain as he takes us through both the high level drivers for the changes in technology, how these are relevant to the enterprise and an overview of the possibilities a Big Data strategy can start to unlock.
3. RockSolid SQL
About this Webinar
Big Data
• What is it?
• Why does it matter?
• How it impacts the Enterprise?
Data Science
Tech Level 100
NoSQL
• Key/Value Stores
• Document Stores
• Graph Databases
NewSQL
• In Memory Transaction Processing
• MPPP
Hadoop
Summary
5. Big Data Webinar
What is Big Data?
“Big data[1] are datasets that grow so large that they become awkward to
work with using on-hand database management tools. Difficulties include
capture, storage,[2] search, sharing, analytics,[3] and visualizing. This trend
continues because of the benefits of working with larger and larger
datasets allowing analysts to "spot business trends, prevent diseases,
combat crime."[4] Though a moving target, current limits are on the order
of terabytes, exabytes and zettabytes of data.[5]”
“Variety– Big data extends beyond structured data, including
unstructured data of all varieties
Velocity – Rate of acquisition and desired rate of consumption.
Volume –Terabytes and even petabytes of information.”
6. My Definition
Big Data is the “modern scale” at which we are defining or data usage
challenges. Big Data begins at the point where need to seriously start
thinking about the technologies used to drive our information needs.
While Big Data as a term seems to refer to volume this isn’t the case. Many
existing technologies have little problem physically handling large volumes
(TB or PB) of data. Instead the Big Data challenges result out of the
combination of volume and our usage demands from that data. And those
usage demands are nearly always tied to timeliness.
Big Data is therefore the push to utilize “modern” volumes of data within
“modern” timeframes. The exact definitions are of course are relative &
constantly changing, however right now this is somewhere along the path
towards the end goal. This is of course the ability to handle an unlimited
volume of data, processing all requests in real time.
Jan 2010
http://blog.tonybain.com/tony_bain/2010/01/what-is-big-data.html
7. For the purpose of this talk
Volume and Variety of Data
that is difficult to mange using
Big Data = traditional data management
technology
8. Where does big data come from?
Big Data is not an outcome of what we consider traditional user
“transactions”.
“VisaNet authorizes, clears and settles an average of 130 million
transactions per day in 200 countries and territories.”
(http://corporate.visa.com/about-visa/technology/transaction-processing.shtml)
Instead Big Data is mostly sourced from “Meta Data” about our
existence
• Sensors used to gather telemetry information
• Social media interactions & relationships
• Pictures and videos posted online
• Cell phone GPS signals
• Web page viewing habits
• Network, Computer generated, equipment logs
• Device outputs
• Online gaming
9. Where does big data come from?
Instead Big Data is generally meta data
“Since we only track the data that the games explicitly want to track, and
it’s all structured data …We just write the 5TB of structured information that
we need, and this is a scale that high-end SQL databases can definitely
handle.”
“eBay processes 50 petabytes of information each day while adding 40
terabytes of data each day as millions buy and sell items across 50,000
categories. Over 5,000 business users and analysts turn over a terabyte of
data every eight seconds. Raising parallel efficiency, the effectiveness of
distributing large amounts of workload over pools and grids of servers
equals millions in operating expense savings.”
10. What about the Enterprise?
The median data warehouse has remained consistent for a number of years (6.6GB)
• Is the pending Big Data explosion hype or a reality?
The enterprise is focused on ROI
• We do store large amounts of historical data as this serves a perceived need & the overhead is low
• We have not yet started capturing large amounts of meta data as demonstrating return on this
investment to date would have been difficult
• More data will be captured as we gain the ability to derive value from that data
• More data will be captured as we understand the value of auxiliary data
The skills gap
• Big data expertise are few and far between
• Technologies are new and raw
• The combination of skills is new
However this is expected to change in the next 2-5 years
12. Why Big Data?
Competitive Advantage is becoming highly time sensitive
• Consumers are fickle, change is easy, word of mouth is
rapid and powerful
• The rate at which we can understand and respond to
changes in customer sentiment, product needs,
satisfaction, engagement, risk profiles = competitive
strength
• Reinvention is constant
So how do we do this?
• With Data Science!
13. What is Data Science?
“I keep saying that the sexy job in the next 10 years will be
statisticians,”
Hal Varian, chief economist at Google
"data is the next Intel Inside.“
Tim O'Reilly, O'Reilly
“We’re rapidly entering a world where everything can be
monitored and measured, but the big problem is going to be the
ability of humans to use, analyze and make sense of the data.”
Erik Brynjolfsson, MIT
“Data science is the civil engineering of data. Its acolytes
possess a practical knowledge of tools & materials, coupled with
a theoretical understanding of what's possible.”
Michael Driscoll , Metamarkets
14. What is Data Science?
Historical Business Analytics
• Effectively reporting what has happened or what is happening
• Using pre-determined models for prediction
• Known data models
• Known questions
• Known outcomes
The Future of Data Science
• Discovering new information from historical data
• Often involves combining public & private data sets
• Using machine learning to create unique prediction models
• Diverse and complex data models
• Questions may not be known
• Outcomes may not be expected
• Visualizing results
15. Public Data Sets
There is a phenomenal amount of relevant publically available data sets
• Geographical, Climatic, Demographic, Knowledge bases
• Historical, political, medical, sports, social, music
Infochimps (15,000 data sets)
• NYSE Daily 1970-2010 Open, Close, High, Low and Volume
• Global Daily Weather Data from the National Climate Data Center (NCDC)
• Average Hours Worked Per Day by Employed Persons: 2005
• Family Net Worth -- Mean and Median Net Worth in Constant
Factual
Data.gov
16. Telecommunications Churn
Analyzed billions of calls from millions of customers
More than 2,000 call history variables such as time to end of contract, subscriber tenure,
number of customer care calls, etc.
Adding Social networking variables improved predicted churn much better.
The "social networking" variables were identified by observing who each customer spoke
with on the phone the most frequently.
If someone in your caller network churned, you were then ~700% more likely to churn as
well.
20. Big Data & the RDBMS
Volume/Velocity is too great for a single system to manage
Big Data often requires distribution over multiple hosts
The general purpose relational database is difficult to scale out
Has been a computer science challenge for many years
• Giving up nothing and scaling out has proven very difficult
21. Relaxing Consistency
Consistent, Eventually (Eventual Consistency)
“Given enough time without changes everything will eventually be consistent”
Many big data technologies relax aspects of CAP to achieve scale, performance &
availability
Not necessarily suitable for all requirements
Not necessarily the only way to achieve scale
Much debate CAP, BASE
Tunable consistency
23. Big Data Webinar
What is NoSQL?
NoSQL term was coined in 1998 however remained largely unused until 2009
Resurgence was the result of key papers from Amazon & Google and the resulting projects at the likes of
Rackspace and Facebook.
• Google Bigtable (http://labs.google.com/papers/bigtable.html)
• Amazon Dynamo (www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)
Started as a design philosophy of simplifying data storage technologies to the point where massive scale
could be achieved
Now more broad, includes a range of data management technologies which are not based on the
relational model.
NoSQL data stores tends to fall into broad categories
• Distributed Key/Value stores
• Document stores
• Graph Databases
A little ironic as many “NoSQL” data stores have evolved support for SQL like queries.
26. NoSQL
Distributed Key/Value Store - Column Family
Database system designed to scale Cassandra
• Distributed scale over many nodes
Oracle Big Data Appliance
Usually focused on transactional • BerkeleyDB
performance
• Usually unsuitable for OLAP Voldemort
Limited querying capabilities Hypertable
• Primarily single record lookup by “key”
Redis
Usually flexible schema
• Row 1 may have different schema to Raik
Row 2
Usually
• No Joins / normalization
• No ordering
• No aggregation
27. NoSQL
Document Model
Records are considered documents MongoDB
Records have a flexible or varying schema CouchDb
depending on attributes of that document
Marklogic
Schemas are usually based on an encoding
such as JSON, BJSON or XML
Document orientated data stores generally
have more functionality than k/v stores so
sacrifice some performance
Document orientated data stores tend to fit
more naturally with OO paradigms than
RDBMS reducing plumbing code
significantly
Scale out is generally possible using a
sharding and replication
28. NoSQL
Graph
Graph databases are used to store entities DEX
and relationships between entities
FlockDB
Nodes, properties and edges
InfitieGraph
Examples of Graph database strengths
• Shortest distance between entities Neo4J
• Social network analysis
• If Bob knows Bill and Mary knows Bill
maybe Bob knows Mary
30. NewSQL
What is NewSQL
Term coined recently by Michael Stonebraker
Not throwing out the baby with the bathwater
• SQL and the Relational Model is well known by almost
every developer
• Arguments that the relational model itself is not a
limitation on scale, it is the physical implementation of
the model which has limitations
Retaining key aspects of the RDBMS but removing some of
the general purpose nature
By specializing challenges associated with scaling can be
more easy overcome
31. NewSQL
In Memory TP
In Memory is becoming a reality for Examples
transaction processing
VoltDB
Often designed only to process short SAP Hana
duration transactions
Systems are appearing with 1TB RAM per
node
Distributed K-Safety protects from single
node failure
Traditional command logging/snapshoting
can provide protection from complete
system failure
• System bug
• Comes with performance costs
32. NewSQL
MPP
Focus on distributing of data and load Examples
across multiple hosts for the purpose of
analytical query performance
Microsoft Parallel Data Warehouse
Often use brute force access methods (i.e.
HP Vertica
no user indexing)
EMC Greenplum
Often focus on running low concurrency
high impact queries Teradata / Aster Data
Often columnar storage or hybrid IBM Netezza
• Improved compression
• Improved query performance across
wide tables
34. Hadoop
What is Hadoop?
Apache Hadoop is an open source project inspired by Google’s Map/Reduce
and GFS papers
• http://labs.google.com/papers/mapreduce.html
• http://labs.google.com/papers/gfs.html
Hadoop consists of
• A distributed files system (HDFS)
• The Map/Reduce engine
Yahoo has been the largest contributor to the project, although now many
original contributors are working for Cloudera or Hortonworks
Hadoop is not a database
• Data lives in the file system in raw format
• Jobs process data direct from the files system using Map/Reduce
• Data can be directly accessed without ETL
35. The Scale of Hadoop
Hadoop Scales to very large clusters, capable of supporting very large
workloads
• Yahoo has ~20 Hadoop clusters ~200PB (largest 4000 nodes)
• Facebook have 2000 nodes in a single cluster with 20PB
Cloudera has 22 customers in prod with over 1PB
BUT we are not all Facebook or Yahoo
Most clusters are less than 30 nodes
• The average cluster size is 200 nodes being boosted by the increasing
number of large clusters
In the Enterprise Hadoop is being used increasingly for
• Data Processing
• Analytics
36. Hadoop
HDFS & Map/Reduce
HDFS Map/Reduce
Distributed file system Programming Paradigm
Responsible for ensuring each Takes seemingly complex tasks
block is replicated to multiple and breaks them into a series of
nodes • Mapping Tasks
• Protection from node failure • Reduction Tasks
• The large the cluster size the
more likely node failure is
Elastic Map/Reduce
37. How is Hadoop used in the Enterprise?
FSI Retail & Manufacturing
Customer Risk Analysis Customer Churn
Surveillance and Fraud Detection Brand and Sentiment Analysis
Central Data Repository Point of Sales
Personalization and Asset Management Pricing Models
Market Risk Modeling Customer Loyalty
Trade Performance Analytics Targeted Offers
Science & Energy
Genomics
Utilities and Power Grid
Smart Meters
Biodiversity Indexing
Network Failures
Seismic Data
39. Where does this leave the RDBMS?
Absolutely the RDBMS will remain the single most important data management technology for the
foreseeable future
• Its “General Purpose” nature still makes it the suitable technology of choice for 95% of data
management needs
• What we have discussed in this presentation is the 5% of requirements which have volume/velocity
issues not suitably managed by the RDBMS
These technologies are on a convergence path
• NoSQL, NewSQL, Hapdoop, MPP – were all created out a need that wasn’t being met
• But we don’t “want” many different technologies, we like one “General Purpose” tool
• Bringing the technologies together through acquisition, innovation and integration is the role of the
major vendors
This is already happening
40. The major Vendors are following
Oracle
• Exadata
• Big Data Appliance (http://www.oracle.com/us/corporate/features/feature-obda-498724.html)
• Oracle Tools & Loader for Hadoop (http://www.oracle.com/us/corporate/features/feature-oracle-
loader-for-hadoop-505115.html)
• Exalytics Appliance (http://gigaom.com/2011/10/02/oracle-exalytics-attacks-big-data-analytics/)
Microsoft
• Parallel Data Warehouse (http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-
warehousing/pdw.aspx)
• SQL Server connector for Hadoop (http://www.microsoft.com/download/en/details.aspx?id=27584)
• Microsoft Hadoop (2012) (http://www.zdnet.com/blog/microsoft/microsoft-to-develop-hadoop-
distributions-for-windows-server-and-azure/10958)
• Microsoft Daytona (http://research.microsoft.com/en-us/projects/daytona/)
41. How does an enterprise prepare?
Understand success metrics & where competitive advantage is derived
• Customer retention, upsell, revenue growth
Is business driven not technology driven
IT can however ensure they are educated on new ways to fulfill business requirements
Even if you remain aligned with major venders diversity is imminent