SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL

SQL vs NoSQL: Why you’ll never
dump your relations
17th March 2015

© 2015 EXASOL AG
BCS Data Management Specialist Group
Dave Shuttleworth – Principal Consultant, Exasol UK
email: dave.shuttleworth@exasol.com
Twitter: @EXA_DaveS

© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda

© 2015 EXASOL AG
 2014-2015 – EXASOL UK – Principal Consultant
 Introducing EXASOL DBMS technology into UK
 2003 - 2014 – Intelligent Edge Group – Principal Consultant
 Data Warehouse design and migration from older technologies to new MPP DBMS
 Business Intelligence infrastructure architect
 New DBMS technology assessment
 1992 - 2003 – WhiteCross Systems (now Kognitio) – Principal Consultant
 Pre-sales and post-sales technical support
 1989 -1992 – Teradata – Consultant
 1980 -1989 – Data General (now part of EMC) – Systems engineer
 1975 -1980 – UK retailer – Analyst programmer
 Applications design and implementation, system management and tuning
My background

© 2015 EXASOL AG
 a column store, in-memory, massively parallel processing (MPP)
database
 modern software designed for analytics
 runs on standard x86 hardware
 Uses standard SQL language (with optional extensions)
 suitable for any scale of data & any number of users
 mature, proven & very cost effective
 quick to implement & easy to operate
The World’s Fastest Analytic Database
What is Exasol?

© 2015 EXASOL AG
QphH@1000 GB 1,000,000 2,000,000 3,000,000 4.000,000
Sept ´14
April ´14
June ´12
Feb ´14
Dec ´13
Aug ´11
Sept ´11
Oct ´11
Dec ´11
Source: www.tpc.org / Sept 22,
2 0 1 5
We are the benchmark leader
5,246,338
Microsoft 134,117
Oracle 201,487
Oracle 209,533
Microsoft 219,887
Sybase IQ 258,474
Oracle 326,454
Vectorwise 445,529
Microsoft 519,976
On 1 Terabyte of data - an order of magnitude faster than its closest rival
Queries per hour

© 2015 EXASOL AG
• Databases and Data Warehouses have evolved to meet the needs of
business (over many years…!)
• Generally using some form of Relational Database (SQL based)
• Originally tightly structured data, now expanding to include unstructured data
• Ever increasing data volumes and complexity
• New technologies have emerged to address (and extend) the storage and
management requirements
• Fast cheap network connectivity
• Cloud services for cheaper and more flexible implementation
• Wider acceptance of open source software for production systems
• Hadoop parallel processing platform – often in a ‘hybrid’ environment
• Alternative database technologies (e.g. document stores, graph databases)
• Publicly accessible data sources (e.g. weather history, flight data, Google
searches. Twitter feeds, census data, mapping data)
• More complex analytics needed to stay competitive
SQL vs NoSQL - background

© 2015 EXASOL AG
• Proliferation of NoSQL (‘not only SQL’) databases – over 150 listed on
nosql.database.org – classified by type:
• Wide Column Stores
• E.g. Hadoop, MapR, Cassandra, MonetDB
• Document stores
• Elasticseach, MongoDB, Couchbase, Marklogic
• Key value/tuple store
• DynamoDB, Azure Table Storage, Oracle NoSQL, MemcacheDB
• Graph databases
• NEO4J, Yarcdata, Graphbase
• Multimodal databases
• Object databases
• etc, etc..

© 2015 EXASOL AG
• The inherent restrictions of relational databases are addressed by
NoSQL implementations :
• More flexible data model – ‘schemaless’ or ‘schema on read’
• ‘Schemaless’ can mean very fast write performance – useful for streaming data
• Simplifies handling of unstructured and semi-structured data such as logfiles,
other machine generated data and text
• Designed for easy scale-up (and scale down) to handle seasonal workloads
• High levels of concurrency can be achieved via distributed processing
• High availability via replication is built in to some NoSQL databases
• Maps well to cloud based infrastructure and capabilities (if done well!)

© 2015 EXASOL AG
Hadoop today is …
 Still Open Source !
 Began with HDFS and Map/Reduce
 Now comprises a number of additional technologies
 File systems
 (e.g. Tachyon)
 Cluster Managers
 (e.g. YARN + Mesos)
 Execution Engines
 (e.g. Tez, Spark etc.)
 Analytical Layer and Applications
 (e.g. Hive, Pig, various SQL on Hadoop)

© 2015 EXASOL AG
Hadoop With Everything?
 Hadoop was invented to more easily distribute the Nutch
web search engine across a cluster of machines.
 Map/Reduce – distributed processing
 HDFS – distributed file system
 Began to be used for …. just about everything.
 But not all processing tasks are like indexing the Internet
 Hadoop started to attract criticism
 But usually when it was being used for something it wasn’t
designed for

© 2015 EXASOL AG
Definitely NOT jobs for Hadoop
 Word processing
 Payroll system
 Anything on a single computer
 Anything with “small” data

© 2015 EXASOL AG
Analytical Queries
 “GROUP BY“ logic
 i.e. not concerned with individual data items
 Analytical Functions
 MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION …
 Table joins, nested subqueries
Usually short-running, ad-hoc and submitted many at a time.

© 2015 EXASOL AG
Map/Reduce and HDFS : the wrong tools for Analytics ?
 Queries tend to be short : fault tolerance is less important
 If chance of failure in a 5 hour batch is 1 in 300
 Chance of failure in a 5 second query is 1 in 1,000,000
 Queries tend to be short : start-up time is significant
 a 20 second start-up time is NOT OK on a 5 second query
 A number of projects started to address these issues
 e.g. “Hot containers” in Hive on Tez to reduce start-up time
 Also Pushdown via Hive partitions or ORC predicate pushdown

© 2015 EXASOL AG
Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation
Map/Reduce: the wrong language for Analytics ?
Stage 0: Map-Shuffle-Reduce
Mapper(row) {
fields = row.split("t")
emit(fields[0], fields[1]);
}
Reducer(key, values) {
sum = 0;
for (value in values) {
sum += value;
}
emit(key, sum);
}
Stage 1: Map-Shuffle
Mapper(row) {
...
emit(page_views, page_name);
}
... shuffle
Stage 2: Local
data = open("stage1.out")
for (i in 0 to 10) {
print(data.getNext())
}

© 2015 EXASOL AG
Equivalent in SQL
SELECT
page_name,
SUM(page_views) views
FROM wikistats
GROUP BY page_name
ORDER BY views DESC
LIMIT 10;

© 2015 EXASOL AG
The SQL language
 Portable
 Well-defined standards exist
 No detailed knowledge of the platform required
 e.g. you don’t need to manage memory
 SQL is assumed by a lot of reporting tools
 Widely used and understood even by non-technical
people

© 2015 EXASOL AG
I‘m not saying that SQL is perfect
• Try writing the simple Hadoop “Word Count” example in
pure SQL
• Or try to “sessionise” weblog data
• Or anything with data that is not structured
• “Which part of STRUCTURED Query Language don’t you
understand …?!”
• All I’m saying is that is an excellent language for
analytical queries.

© 2015 EXASOL AG
Hadoop could handle SQL (via Hive), but historically …
 High Latency
 Restricted SQL options
 All but simple table joins were difficult
 Little support for compression & indexing
 Merv Adrian (Gartner Research - 2014)
 “What is remarkable is that Hadoop does SQL.
Just don’t expect it to do it well”
 Result : EVERYTHING looked good compared to Hive

© 2015 EXASOL AG
Everyone still likes to compare themselves to Hive

© 2015 EXASOL AG
EXASOL being no exception !

© 2015 EXASOL AG
Hive continues to be improved …
 Completed
 Views (HIVE-1143)
 Partitioned Views (HIVE-1941)
 Storage Handlers (HIVE-705)
 HBase Integration
 HBase Bulk Load
 Locking (HIVE-1293)
 Indexes (HIVE-417)
 Bitmap Indexes (HIVE-1803)
 Filter Pushdown (HIVE-279)
 Table-level Statistics (HIVE-1361)
 Dynamic Partitions
 Binary Data Type (HIVE-2380)
 Decimal Precision and Scale Support
 HCatalog
 HiveServer2 (HIVE-2935)
 Column Statistics in Hive (HIVE-1362)
 List Bucketing (HIVE-3026)
 Group By With Rollup (HIVE-2397)
 Enhanced Aggregation, Cube, Grouping
and Rollup (HIVE-3433)
 Optimizing Skewed Joins (HIVE-3086)
 Correlation Optimizer (HIVE-2206)
 Hive on Tez (HIVE-4660)
 Vectorized Query Execution (HIVE-
4160)
 In Progress
 Atomic Insert/Update/Delete (HIVE-
5317)
 Transaction Manager (HIVE-5843)
 Cost Based Optimizer in Hive (HIVE-
5775)
 Proposed
 Spatial Queries
 Theta Join (HIVE-556)
 JDBC Storage Handler
 MapJoin Optimization
 Proposal to standardize and expand
Authorization in Hive
 Dependent Tables (HIVE-3466)
 AccessServer
 Type Qualifiers in Hive
 MapJoin & Partition Pruning (HIVE-
5119)
 SQL Standard based secure
authorization (HIVE-5837)
 Updatable Views (HIVE-1143)
 Hive on Spark (HIVE-7292)

© 2015 EXASOL AG
The dream data architecture for analytics …
Based on the SQL language
but leverages Hadoop’s extreme scalability
and Hadoop’s fault tolerance
while not compromising on speed.
Could it please also have some maturity ?
And be easy to use ?

© 2015 EXASOL AG
The current reality
 SQL on SQL, which is arguably
 Less scalable
 Less fault tolerant
 Less good with unstructured data
 SQL on Hadoop, which is arguably
 Less mature
 Less easy to use
 Slower

© 2015 EXASOL AG
Choices for SQL and Hadoop
 SQL AND HADOOP
 A Connector
 HADOOP ON SQL
 User Defined Functions
 SQL ON HADOOP
 Something like Hive, but better

© 2015 EXASOL AG
Option 1 – SQL AND HADOOP
Run SQL on SQL and Hadoop on Hadoop and use a connector
to join the two systems
Pros
 Minimal impact (SQL and Hadoop worlds can function as before)
 Easier to implement
Cons
 Network !
 Challenge of optimising across two technologies

© 2015 EXASOL AG
Option 2 – HADOOP ON SQL
 Bring Map/Reduce into the Parallel database
 For example using Java User Defined Functions
select my_java_map_function(words) a_word,
count(*) word_count
from DOCUMENTS
group by 1
 Doesn’t benefit from Hadoop’s storage advantages

© 2015 EXASOL AG
Option 3 - SQL ON HADOOP
Build a relational database on Hadoop storage
 Impala (Cloudera)
 Stinger (Hortonworks)
 Presto (Facebook)
 SparkSQL (UC Berkeley)
 HAWQ (Pivotal)
 BigSQL (IBM)
 Apache Phoenix (for HBase)
 Apache Tajo
 Apache Drill
 etc etc etc ….
AND DON‘T FORGET HIVE !

© 2015 EXASOL AG
Four possible market outcomes…
 Hadoop and SQL databases are on a collision course – only
one will survive
 No sign of that so far
 They are complementary – both will survive
 Probably - the challenge is how to make them work together
 They will merge and become one
 Some indications this is already starting to happen
 Something even more amazing will come along and replace
them both
 Sometimes this happens – Spark ?

© 2015 EXASOL AG
What do the pundits say?
 Martin Fowler – Thoughtworks
 The rise of NoSQL databases marks the end of the era of relational database
dominance
 But NoSQL databases will not become the new dominators. Relational will still
be popular, and used in the majority of situations. They, however, will no longer
be the automatic choice.
 The era of Polyglot Persistence has begun - where any decent sized enterprise
will have a variety of different data storage technologies for different kinds of
data
 Emil Eifrem – Neo Technology
 When evaluating a NoSQL database, it is critical to demand enterprise-
readiness. An enterprise delivering modern applications needs a NoSQL
database that can manage today's complex and connected data while still
delivering the enterprise strength, transactions and durability that IT
departments have relied on for years.

© 2015 EXASOL AG
37
King in numbers
• 100 million daily active users
• 1 billion game plays per day
• 8 offices
And lots and lots of data...
• 14 billion rows per day
• 500 Gb per day new
• 700 Tb stored
Case Study - King

© 2015 EXASOL AG
King - Getting to know 500 million players
Objectives in game analytics
38
• Metrics and KPIs
• Measure and understand player behaviour
• Player segmentation
• Improve player experience
• Forecasting
• Predictive modelling

© 2015 EXASOL AG
39
Challenges at King
• Extreme scale
• Rate of growth
• Speed of innovation
• Cross platform
• Virtual economies

© 2015 EXASOL AG
40
The King formula
• Data driven culture
• Engaged business
• Talented embedded data scientists
• AB testing
• Right technology platform
• Right data model

© 2015 EXASOL AG
System architecture
41
How King does data
Game
servers
Log
server
Reports
Data
scientists
Data WarehouseTSV log
files
Dimensional
model
Raw
data
ETL

© 2015 EXASOL AG
Our data keeps growing...
42
How King does data
King launches
on mobile...

© 2015 EXASOL AG
…our technology has to keep up
43
How King does data
Qlikview says no
Infobright CE
says no
10 node
Hadoop
80 nodes
40 nodes
20 nodes
InfiniDB
Exasol

© 2015 EXASOL AG
46
Why ExaSolution?
• Speed
• Efficiency
• Tuning free
• Scaling (150Tb and counting...)
• ExaDudes
How King does data

© 2015 EXASOL AG
51
Future challenges
• Keep on scaling
• Closer Hadoop integration
• Evolving data model
• Microbatch ETL
• Real(er) time…
Where next?

© 2015 EXASOL AG
• A definition:
• The Internet of Things (IoT) is a scenario in which objects, animals or people are
provided with unique identifiers and the ability to transfer data over a network
without requiring human-to-human or human-to-computer interaction
• Basic concept has been around for decades – now accepted into the
mainstream
• Wide range of potential uses:
• Environmental monitoring
• Infrastructure management
• Manufacturing
• Energy management
• Medical and healthcare systems
• Building and home automation
• Transport systems
Internet of Things

© 2015 EXASOL AG
• Wearable technologies – e.g. smart watches, Google Glass
• Bio sensors for humans (and other animals)
• Health monitoring
• Already in use on some dairy farms – optimise milk yields and give early
warning for possible disease
• Location based data
• All modern phones provide location data (either GPS or cell based)
• ‘crowd sourcing’ – e.g. traffic flow based on cellphone signals
• Beacons – e.g. Regent Street in London
• Location-based special offers and advertisement
• Facial recognition
• To drive targetted advertisements
Other emerging technologies which produce data

© 2015 EXASOL AG
• Cloud being used for evaluation of new technologies and also as a platform for
dev/test (and even DR) environments
• In-database analytics using UDFs in languages such a R, Lua and Python
• Move the processing closer to the data
• Run analytics on full data volumes (no sampling/extract required)
• Get improved performance due to parallelism (where possible)
• Lots of freely available R code on the web
• Automated conversion of analytical results to text (NLG) is emerging
• AI rule-based generation of natural language output
• Readable summaries and recommendations
• Yseop, NarrativeScience, Automated Insights, Arria NLG
Other emerging trends

© 2015 EXASOL AG
• Data and database technology isn’t going away!
• New database approaches are being developed to address the
requirements of flexibility, scalability etc
• These technologies drive an increasing need for more analysts,
database designers, data scientists
• Hybrid systems are becoming the norm, with companies mixing ‘best
of breed’ technologies (possibly open source) to get the best and
most cost-effective results – use ‘the right tool for the job’
• SQL databases will continue to be widely utilised – but alongside
other technologies and integration will become tighter
Summary

Presentation to insert name here 60

Presentation to insert name here 61

SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL

More Related Content

What's hot

Similar to SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL

Recently uploaded

SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL