SQL vs NoSQL: Why you’ll never
dump your relations
17th March 2015
© 2015 EXASOL AG
BCS Data Management Specialist Group
Dave Shuttleworth – Principal Consultant, Exasol UK
email: dave.shuttleworth@exasol.com
Twitter: @EXA_DaveS
© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda
© 2015 EXASOL AG
 2014-2015 – EXASOL UK – Principal Consultant
 Introducing EXASOL DBMS technology into UK
 2003 - 2014 – Intelligent Edge Group – Principal Consultant
 Data Warehouse design and migration from older technologies to new MPP DBMS
 Business Intelligence infrastructure architect
 New DBMS technology assessment
 1992 - 2003 – WhiteCross Systems (now Kognitio) – Principal Consultant
 Pre-sales and post-sales technical support
 1989 -1992 – Teradata – Consultant
 Pre-sales and post-sales technical support
 1980 -1989 – Data General (now part of EMC) – Systems engineer
 Pre-sales and post-sales technical support
 1975 -1980 – UK retailer – Analyst programmer
 Applications design and implementation, system management and tuning
My background
© 2015 EXASOL AG
 a column store, in-memory, massively parallel processing (MPP)
database
 modern software designed for analytics
 runs on standard x86 hardware
 Uses standard SQL language (with optional extensions)
 suitable for any scale of data & any number of users
 mature, proven & very cost effective
 quick to implement & easy to operate
The World’s Fastest Analytic Database
What is Exasol?
© 2015 EXASOL AG
QphH@1000 GB 1,000,000 2,000,000 3,000,000 4.000,000
Sept ´14
April ´14
June ´12
Feb ´14
Dec ´13
Aug ´11
Sept ´11
Oct ´11
Dec ´11
Source: www.tpc.org / Sept 22,
2 0 1 5
We are the benchmark leader
5,246,338
Microsoft 134,117
Oracle 201,487
Oracle 209,533
Microsoft 219,887
Sybase IQ 258,474
Oracle 326,454
Vectorwise 445,529
Microsoft 519,976
On 1 Terabyte of data - an order of magnitude faster than its closest rival
Queries per hour
© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda
© 2015 EXASOL AG
• Databases and Data Warehouses have evolved to meet the needs of
business (over many years…!)
• Generally using some form of Relational Database (SQL based)
• Originally tightly structured data, now expanding to include unstructured data
• Ever increasing data volumes and complexity
• New technologies have emerged to address (and extend) the storage and
management requirements
• Fast cheap network connectivity
• Cloud services for cheaper and more flexible implementation
• Wider acceptance of open source software for production systems
• Hadoop parallel processing platform – often in a ‘hybrid’ environment
• Alternative database technologies (e.g. document stores, graph databases)
• Publicly accessible data sources (e.g. weather history, flight data, Google
searches. Twitter feeds, census data, mapping data)
• More complex analytics needed to stay competitive
SQL vs NoSQL - background
© 2015 EXASOL AG
• Proliferation of NoSQL (‘not only SQL’) databases – over 150 listed on
nosql.database.org – classified by type:
• Wide Column Stores
• E.g. Hadoop, MapR, Cassandra, MonetDB
• Document stores
• Elasticseach, MongoDB, Couchbase, Marklogic
• Key value/tuple store
• DynamoDB, Azure Table Storage, Oracle NoSQL, MemcacheDB
• Graph databases
• NEO4J, Yarcdata, Graphbase
• Multimodal databases
• Object databases
• etc, etc..
SQL vs NoSQL - background
© 2015 EXASOL AG
• The inherent restrictions of relational databases are addressed by
NoSQL implementations :
• More flexible data model – ‘schemaless’ or ‘schema on read’
• ‘Schemaless’ can mean very fast write performance – useful for streaming data
• Simplifies handling of unstructured and semi-structured data such as logfiles,
other machine generated data and text
• Designed for easy scale-up (and scale down) to handle seasonal workloads
• High levels of concurrency can be achieved via distributed processing
• High availability via replication is built in to some NoSQL databases
• Maps well to cloud based infrastructure and capabilities (if done well!)
SQL vs NoSQL - background
© 2015 EXASOL AG
Hadoop today is …
 Still Open Source !
 Began with HDFS and Map/Reduce
 Now comprises a number of additional technologies
 File systems
 (e.g. Tachyon)
 Cluster Managers
 (e.g. YARN + Mesos)
 Execution Engines
 (e.g. Tez, Spark etc.)
 Analytical Layer and Applications
 (e.g. Hive, Pig, various SQL on Hadoop)
© 2015 EXASOL AG
Hadoop With Everything?
 Hadoop was invented to more easily distribute the Nutch
web search engine across a cluster of machines.
 Map/Reduce – distributed processing
 HDFS – distributed file system
 Began to be used for …. just about everything.
 But not all processing tasks are like indexing the Internet
 Hadoop started to attract criticism
 But usually when it was being used for something it wasn’t
designed for
© 2015 EXASOL AG
Definitely NOT jobs for Hadoop
 Word processing
 Payroll system
 Anything on a single computer
 Anything with “small” data
© 2015 EXASOL AG
Analytical Queries
 “GROUP BY“ logic
 i.e. not concerned with individual data items
 Analytical Functions
 MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION …
 Table joins, nested subqueries
Usually short-running, ad-hoc and submitted many at a time.
© 2015 EXASOL AG
Map/Reduce and HDFS : the wrong tools for Analytics ?
 Queries tend to be short : fault tolerance is less important
 If chance of failure in a 5 hour batch is 1 in 300
 Chance of failure in a 5 second query is 1 in 1,000,000
 Queries tend to be short : start-up time is significant
 a 20 second start-up time is NOT OK on a 5 second query
 A number of projects started to address these issues
 e.g. “Hot containers” in Hive on Tez to reduce start-up time
 Also Pushdown via Hive partitions or ORC predicate pushdown
© 2015 EXASOL AG
Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation
Map/Reduce: the wrong language for Analytics ?
Stage 0: Map-Shuffle-Reduce
Mapper(row) {
fields = row.split("t")
emit(fields[0], fields[1]);
}
Reducer(key, values) {
sum = 0;
for (value in values) {
sum += value;
}
emit(key, sum);
}
Stage 1: Map-Shuffle
Mapper(row) {
...
emit(page_views, page_name);
}
... shuffle
Stage 2: Local
data = open("stage1.out")
for (i in 0 to 10) {
print(data.getNext())
}
© 2015 EXASOL AG
Equivalent in SQL
SELECT
page_name,
SUM(page_views) views
FROM wikistats
GROUP BY page_name
ORDER BY views DESC
LIMIT 10;
© 2015 EXASOL AG
The SQL language
 Portable
 Well-defined standards exist
 No detailed knowledge of the platform required
 e.g. you don’t need to manage memory
 SQL is assumed by a lot of reporting tools
 Widely used and understood even by non-technical
people
© 2015 EXASOL AG
I‘m not saying that SQL is perfect
• Try writing the simple Hadoop “Word Count” example in
pure SQL
• Or try to “sessionise” weblog data
• Or anything with data that is not structured
• “Which part of STRUCTURED Query Language don’t you
understand …?!”
• All I’m saying is that is an excellent language for
analytical queries.
© 2015 EXASOL AG
Hadoop could handle SQL (via Hive), but historically …
 High Latency
 Restricted SQL options
 All but simple table joins were difficult
 Little support for compression & indexing
 Merv Adrian (Gartner Research - 2014)
 “What is remarkable is that Hadoop does SQL.
Just don’t expect it to do it well”
 Result : EVERYTHING looked good compared to Hive
© 2015 EXASOL AG
Everyone still likes to compare themselves to Hive
© 2015 EXASOL AG
EXASOL being no exception !
© 2015 EXASOL AG
Hive continues to be improved …
 Completed
 Views (HIVE-1143)
 Partitioned Views (HIVE-1941)
 Storage Handlers (HIVE-705)
 HBase Integration
 HBase Bulk Load
 Locking (HIVE-1293)
 Indexes (HIVE-417)
 Bitmap Indexes (HIVE-1803)
 Filter Pushdown (HIVE-279)
 Table-level Statistics (HIVE-1361)
 Dynamic Partitions
 Binary Data Type (HIVE-2380)
 Decimal Precision and Scale Support
 HCatalog
 HiveServer2 (HIVE-2935)
 Column Statistics in Hive (HIVE-1362)
 List Bucketing (HIVE-3026)
 Group By With Rollup (HIVE-2397)
 Enhanced Aggregation, Cube, Grouping
and Rollup (HIVE-3433)
 Optimizing Skewed Joins (HIVE-3086)
 Correlation Optimizer (HIVE-2206)
 Hive on Tez (HIVE-4660)
 Vectorized Query Execution (HIVE-
4160)
 In Progress
 Atomic Insert/Update/Delete (HIVE-
5317)
 Transaction Manager (HIVE-5843)
 Cost Based Optimizer in Hive (HIVE-
5775)
 Proposed
 Spatial Queries
 Theta Join (HIVE-556)
 JDBC Storage Handler
 MapJoin Optimization
 Proposal to standardize and expand
Authorization in Hive
 Dependent Tables (HIVE-3466)
 AccessServer
 Type Qualifiers in Hive
 MapJoin & Partition Pruning (HIVE-
5119)
 SQL Standard based secure
authorization (HIVE-5837)
 Updatable Views (HIVE-1143)
 Hive on Spark (HIVE-7292)
© 2015 EXASOL AG
The dream data architecture for analytics …
Based on the SQL language
but leverages Hadoop’s extreme scalability
and Hadoop’s fault tolerance
while not compromising on speed.
Could it please also have some maturity ?
And be easy to use ?
© 2015 EXASOL AG
The current reality
 SQL on SQL, which is arguably
 Less scalable
 Less fault tolerant
 Less good with unstructured data
 SQL on Hadoop, which is arguably
 Less mature
 Less easy to use
 Slower
© 2015 EXASOL AG
Choices for SQL and Hadoop
 SQL AND HADOOP
 A Connector
 HADOOP ON SQL
 User Defined Functions
 SQL ON HADOOP
 Something like Hive, but better
© 2015 EXASOL AG
Option 1 – SQL AND HADOOP
Run SQL on SQL and Hadoop on Hadoop and use a connector
to join the two systems
Pros
 Minimal impact (SQL and Hadoop worlds can function as before)
 Easier to implement
Cons
 Network !
 Challenge of optimising across two technologies
© 2015 EXASOL AG
Option 2 – HADOOP ON SQL
 Bring Map/Reduce into the Parallel database
 For example using Java User Defined Functions
select my_java_map_function(words) a_word,
count(*) word_count
from DOCUMENTS
group by 1
 Doesn’t benefit from Hadoop’s storage advantages
© 2015 EXASOL AG
Option 3 - SQL ON HADOOP
Build a relational database on Hadoop storage
 Impala (Cloudera)
 Stinger (Hortonworks)
 Presto (Facebook)
 SparkSQL (UC Berkeley)
 HAWQ (Pivotal)
 BigSQL (IBM)
 Apache Phoenix (for HBase)
 Apache Tajo
 Apache Drill
 etc etc etc ….
AND DON‘T FORGET HIVE !
© 2015 EXASOL AG
Four possible market outcomes…
 Hadoop and SQL databases are on a collision course – only
one will survive
 No sign of that so far
 They are complementary – both will survive
 Probably - the challenge is how to make them work together
 They will merge and become one
 Some indications this is already starting to happen
 Something even more amazing will come along and replace
them both
 Sometimes this happens – Spark ?
© 2015 EXASOL AG
What do the pundits say?
 Martin Fowler – Thoughtworks
 The rise of NoSQL databases marks the end of the era of relational database
dominance
 But NoSQL databases will not become the new dominators. Relational will still
be popular, and used in the majority of situations. They, however, will no longer
be the automatic choice.
 The era of Polyglot Persistence has begun - where any decent sized enterprise
will have a variety of different data storage technologies for different kinds of
data
 Emil Eifrem – Neo Technology
 When evaluating a NoSQL database, it is critical to demand enterprise-
readiness. An enterprise delivering modern applications needs a NoSQL
database that can manage today's complex and connected data while still
delivering the enterprise strength, transactions and durability that IT
departments have relied on for years.
© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda
© 2015 EXASOL AG
37
King in numbers
• 100 million daily active users
• 1 billion game plays per day
• 8 offices
And lots and lots of data...
• 14 billion rows per day
• 500 Gb per day new
• 700 Tb stored
Case Study - King
© 2015 EXASOL AG
King - Getting to know 500 million players
Objectives in game analytics
38
• Metrics and KPIs
• Measure and understand player behaviour
• Player segmentation
• Improve player experience
• Forecasting
• Predictive modelling
© 2015 EXASOL AG
39
Challenges at King
• Extreme scale
• Rate of growth
• Speed of innovation
• Cross platform
• Virtual economies
King - Getting to know 500 million players
© 2015 EXASOL AG
40
The King formula
• Data driven culture
• Engaged business
• Talented embedded data scientists
• AB testing
• Right technology platform
• Right data model
King - Getting to know 500 million players
© 2015 EXASOL AG
System architecture
41
How King does data
Game
servers
Log
server
Reports
Data
scientists
Data WarehouseTSV log
files
Dimensional
model
Raw
data
ETL
© 2015 EXASOL AG
Our data keeps growing...
42
How King does data
King launches
on mobile...
© 2015 EXASOL AG
…our technology has to keep up
43
How King does data
Qlikview says no
Infobright CE
says no
10 node
Hadoop
80 nodes
40 nodes
20 nodes
InfiniDB
Exasol
© 2015 EXASOL AG
Data platform 1.0
44
How King does data
Games
Event
data
Hive
Reports
Data
scientists
ETL
© 2015 EXASOL AG
Data platform 1.5
45
How King does data
Games
Event
data
Hive DB
Reports
Data
scientists
ETL
© 2015 EXASOL AG
46
Why ExaSolution?
• Speed
• Efficiency
• Tuning free
• Scaling (150Tb and counting...)
• ExaDudes
How King does data
© 2015 EXASOL AG
Performance
47
How King does data
© 2015 EXASOL AG
48
Games
Event
data
Hive Exasol
Reports
Data
scientists
ETL
Data platform 2.0
How King does data
© 2015 EXASOL AG
49
Benefits
• ETL times slashed
• Cost saving
• Tuning free
• Scaling
How King does data
© 2015 EXASOL AG
Data platform 3.0
50
Where next?
Games
Event
data
Exasol Hive
Reports
Data
scientists
ETL
© 2015 EXASOL AG
51
Future challenges
• Keep on scaling
• Closer Hadoop integration
• Evolving data model
• Microbatch ETL
• Real(er) time…
Where next?
© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda
© 2015 EXASOL AG
What’s hot?
© 2015 EXASOL AG
• A definition:
• The Internet of Things (IoT) is a scenario in which objects, animals or people are
provided with unique identifiers and the ability to transfer data over a network
without requiring human-to-human or human-to-computer interaction
• Basic concept has been around for decades – now accepted into the
mainstream
• Wide range of potential uses:
• Environmental monitoring
• Infrastructure management
• Manufacturing
• Energy management
• Medical and healthcare systems
• Building and home automation
• Transport systems
Internet of Things
© 2015 EXASOL AG
• Wearable technologies – e.g. smart watches, Google Glass
• Bio sensors for humans (and other animals)
• Health monitoring
• Already in use on some dairy farms – optimise milk yields and give early
warning for possible disease
• Location based data
• All modern phones provide location data (either GPS or cell based)
• ‘crowd sourcing’ – e.g. traffic flow based on cellphone signals
• Beacons – e.g. Regent Street in London
• Location-based special offers and advertisement
• Facial recognition
• To drive targetted advertisements
Other emerging technologies which produce data
© 2015 EXASOL AG
• Cloud being used for evaluation of new technologies and also as a platform for
dev/test (and even DR) environments
• In-database analytics using UDFs in languages such a R, Lua and Python
• Move the processing closer to the data
• Run analytics on full data volumes (no sampling/extract required)
• Get improved performance due to parallelism (where possible)
• Lots of freely available R code on the web
• Automated conversion of analytical results to text (NLG) is emerging
• AI rule-based generation of natural language output
• Readable summaries and recommendations
• Yseop, NarrativeScience, Automated Insights, Arria NLG
Other emerging trends
© 2015 EXASOL AG
• Data and database technology isn’t going away!
• New database approaches are being developed to address the
requirements of flexibility, scalability etc
• These technologies drive an increasing need for more analysts,
database designers, data scientists
• Hybrid systems are becoming the norm, with companies mixing ‘best
of breed’ technologies (possibly open source) to get the best and
most cost-effective results – use ‘the right tool for the job’
• SQL databases will continue to be widely utilised – but alongside
other technologies and integration will become tighter
Summary
© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda
© 2015 EXASOL AG
Dave Shuttleworth
Twitter: @EXA_Daves
Email: dave.shuttleworth@exasol.com
Any questions?
Presentation to insert name here 60
Presentation to insert name here 61

SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL

  • 1.
    SQL vs NoSQL:Why you’ll never dump your relations 17th March 2015
  • 2.
    © 2015 EXASOLAG BCS Data Management Specialist Group Dave Shuttleworth – Principal Consultant, Exasol UK email: dave.shuttleworth@exasol.com Twitter: @EXA_DaveS
  • 3.
    © 2015 EXASOLAG  Introduction & background  SQL vs NoSQL - observations  Case study  King – online gaming  What’s hot?  Q & A Agenda
  • 4.
    © 2015 EXASOLAG  2014-2015 – EXASOL UK – Principal Consultant  Introducing EXASOL DBMS technology into UK  2003 - 2014 – Intelligent Edge Group – Principal Consultant  Data Warehouse design and migration from older technologies to new MPP DBMS  Business Intelligence infrastructure architect  New DBMS technology assessment  1992 - 2003 – WhiteCross Systems (now Kognitio) – Principal Consultant  Pre-sales and post-sales technical support  1989 -1992 – Teradata – Consultant  Pre-sales and post-sales technical support  1980 -1989 – Data General (now part of EMC) – Systems engineer  Pre-sales and post-sales technical support  1975 -1980 – UK retailer – Analyst programmer  Applications design and implementation, system management and tuning My background
  • 5.
    © 2015 EXASOLAG  a column store, in-memory, massively parallel processing (MPP) database  modern software designed for analytics  runs on standard x86 hardware  Uses standard SQL language (with optional extensions)  suitable for any scale of data & any number of users  mature, proven & very cost effective  quick to implement & easy to operate The World’s Fastest Analytic Database What is Exasol?
  • 6.
    © 2015 EXASOLAG QphH@1000 GB 1,000,000 2,000,000 3,000,000 4.000,000 Sept ´14 April ´14 June ´12 Feb ´14 Dec ´13 Aug ´11 Sept ´11 Oct ´11 Dec ´11 Source: www.tpc.org / Sept 22, 2 0 1 5 We are the benchmark leader 5,246,338 Microsoft 134,117 Oracle 201,487 Oracle 209,533 Microsoft 219,887 Sybase IQ 258,474 Oracle 326,454 Vectorwise 445,529 Microsoft 519,976 On 1 Terabyte of data - an order of magnitude faster than its closest rival Queries per hour
  • 7.
    © 2015 EXASOLAG  Introduction & background  SQL vs NoSQL - observations  Case study  King – online gaming  What’s hot?  Q & A Agenda
  • 8.
    © 2015 EXASOLAG • Databases and Data Warehouses have evolved to meet the needs of business (over many years…!) • Generally using some form of Relational Database (SQL based) • Originally tightly structured data, now expanding to include unstructured data • Ever increasing data volumes and complexity • New technologies have emerged to address (and extend) the storage and management requirements • Fast cheap network connectivity • Cloud services for cheaper and more flexible implementation • Wider acceptance of open source software for production systems • Hadoop parallel processing platform – often in a ‘hybrid’ environment • Alternative database technologies (e.g. document stores, graph databases) • Publicly accessible data sources (e.g. weather history, flight data, Google searches. Twitter feeds, census data, mapping data) • More complex analytics needed to stay competitive SQL vs NoSQL - background
  • 9.
    © 2015 EXASOLAG • Proliferation of NoSQL (‘not only SQL’) databases – over 150 listed on nosql.database.org – classified by type: • Wide Column Stores • E.g. Hadoop, MapR, Cassandra, MonetDB • Document stores • Elasticseach, MongoDB, Couchbase, Marklogic • Key value/tuple store • DynamoDB, Azure Table Storage, Oracle NoSQL, MemcacheDB • Graph databases • NEO4J, Yarcdata, Graphbase • Multimodal databases • Object databases • etc, etc.. SQL vs NoSQL - background
  • 10.
    © 2015 EXASOLAG • The inherent restrictions of relational databases are addressed by NoSQL implementations : • More flexible data model – ‘schemaless’ or ‘schema on read’ • ‘Schemaless’ can mean very fast write performance – useful for streaming data • Simplifies handling of unstructured and semi-structured data such as logfiles, other machine generated data and text • Designed for easy scale-up (and scale down) to handle seasonal workloads • High levels of concurrency can be achieved via distributed processing • High availability via replication is built in to some NoSQL databases • Maps well to cloud based infrastructure and capabilities (if done well!) SQL vs NoSQL - background
  • 11.
    © 2015 EXASOLAG Hadoop today is …  Still Open Source !  Began with HDFS and Map/Reduce  Now comprises a number of additional technologies  File systems  (e.g. Tachyon)  Cluster Managers  (e.g. YARN + Mesos)  Execution Engines  (e.g. Tez, Spark etc.)  Analytical Layer and Applications  (e.g. Hive, Pig, various SQL on Hadoop)
  • 12.
    © 2015 EXASOLAG Hadoop With Everything?  Hadoop was invented to more easily distribute the Nutch web search engine across a cluster of machines.  Map/Reduce – distributed processing  HDFS – distributed file system  Began to be used for …. just about everything.  But not all processing tasks are like indexing the Internet  Hadoop started to attract criticism  But usually when it was being used for something it wasn’t designed for
  • 13.
    © 2015 EXASOLAG Definitely NOT jobs for Hadoop  Word processing  Payroll system  Anything on a single computer  Anything with “small” data
  • 14.
    © 2015 EXASOLAG Analytical Queries  “GROUP BY“ logic  i.e. not concerned with individual data items  Analytical Functions  MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION …  Table joins, nested subqueries Usually short-running, ad-hoc and submitted many at a time.
  • 15.
    © 2015 EXASOLAG Map/Reduce and HDFS : the wrong tools for Analytics ?  Queries tend to be short : fault tolerance is less important  If chance of failure in a 5 hour batch is 1 in 300  Chance of failure in a 5 second query is 1 in 1,000,000  Queries tend to be short : start-up time is significant  a 20 second start-up time is NOT OK on a 5 second query  A number of projects started to address these issues  e.g. “Hot containers” in Hive on Tez to reduce start-up time  Also Pushdown via Hive partitions or ORC predicate pushdown
  • 16.
    © 2015 EXASOLAG Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation Map/Reduce: the wrong language for Analytics ? Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("t") emit(fields[0], fields[1]); } Reducer(key, values) { sum = 0; for (value in values) { sum += value; } emit(key, sum); } Stage 1: Map-Shuffle Mapper(row) { ... emit(page_views, page_name); } ... shuffle Stage 2: Local data = open("stage1.out") for (i in 0 to 10) { print(data.getNext()) }
  • 17.
    © 2015 EXASOLAG Equivalent in SQL SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10;
  • 18.
    © 2015 EXASOLAG The SQL language  Portable  Well-defined standards exist  No detailed knowledge of the platform required  e.g. you don’t need to manage memory  SQL is assumed by a lot of reporting tools  Widely used and understood even by non-technical people
  • 19.
    © 2015 EXASOLAG I‘m not saying that SQL is perfect • Try writing the simple Hadoop “Word Count” example in pure SQL • Or try to “sessionise” weblog data • Or anything with data that is not structured • “Which part of STRUCTURED Query Language don’t you understand …?!” • All I’m saying is that is an excellent language for analytical queries.
  • 20.
    © 2015 EXASOLAG Hadoop could handle SQL (via Hive), but historically …  High Latency  Restricted SQL options  All but simple table joins were difficult  Little support for compression & indexing  Merv Adrian (Gartner Research - 2014)  “What is remarkable is that Hadoop does SQL. Just don’t expect it to do it well”  Result : EVERYTHING looked good compared to Hive
  • 21.
    © 2015 EXASOLAG Everyone still likes to compare themselves to Hive
  • 22.
    © 2015 EXASOLAG EXASOL being no exception !
  • 23.
    © 2015 EXASOLAG Hive continues to be improved …  Completed  Views (HIVE-1143)  Partitioned Views (HIVE-1941)  Storage Handlers (HIVE-705)  HBase Integration  HBase Bulk Load  Locking (HIVE-1293)  Indexes (HIVE-417)  Bitmap Indexes (HIVE-1803)  Filter Pushdown (HIVE-279)  Table-level Statistics (HIVE-1361)  Dynamic Partitions  Binary Data Type (HIVE-2380)  Decimal Precision and Scale Support  HCatalog  HiveServer2 (HIVE-2935)  Column Statistics in Hive (HIVE-1362)  List Bucketing (HIVE-3026)  Group By With Rollup (HIVE-2397)  Enhanced Aggregation, Cube, Grouping and Rollup (HIVE-3433)  Optimizing Skewed Joins (HIVE-3086)  Correlation Optimizer (HIVE-2206)  Hive on Tez (HIVE-4660)  Vectorized Query Execution (HIVE- 4160)  In Progress  Atomic Insert/Update/Delete (HIVE- 5317)  Transaction Manager (HIVE-5843)  Cost Based Optimizer in Hive (HIVE- 5775)  Proposed  Spatial Queries  Theta Join (HIVE-556)  JDBC Storage Handler  MapJoin Optimization  Proposal to standardize and expand Authorization in Hive  Dependent Tables (HIVE-3466)  AccessServer  Type Qualifiers in Hive  MapJoin & Partition Pruning (HIVE- 5119)  SQL Standard based secure authorization (HIVE-5837)  Updatable Views (HIVE-1143)  Hive on Spark (HIVE-7292)
  • 24.
    © 2015 EXASOLAG The dream data architecture for analytics … Based on the SQL language but leverages Hadoop’s extreme scalability and Hadoop’s fault tolerance while not compromising on speed. Could it please also have some maturity ? And be easy to use ?
  • 25.
    © 2015 EXASOLAG The current reality  SQL on SQL, which is arguably  Less scalable  Less fault tolerant  Less good with unstructured data  SQL on Hadoop, which is arguably  Less mature  Less easy to use  Slower
  • 26.
    © 2015 EXASOLAG Choices for SQL and Hadoop  SQL AND HADOOP  A Connector  HADOOP ON SQL  User Defined Functions  SQL ON HADOOP  Something like Hive, but better
  • 27.
    © 2015 EXASOLAG Option 1 – SQL AND HADOOP Run SQL on SQL and Hadoop on Hadoop and use a connector to join the two systems Pros  Minimal impact (SQL and Hadoop worlds can function as before)  Easier to implement Cons  Network !  Challenge of optimising across two technologies
  • 28.
    © 2015 EXASOLAG Option 2 – HADOOP ON SQL  Bring Map/Reduce into the Parallel database  For example using Java User Defined Functions select my_java_map_function(words) a_word, count(*) word_count from DOCUMENTS group by 1  Doesn’t benefit from Hadoop’s storage advantages
  • 29.
    © 2015 EXASOLAG Option 3 - SQL ON HADOOP Build a relational database on Hadoop storage  Impala (Cloudera)  Stinger (Hortonworks)  Presto (Facebook)  SparkSQL (UC Berkeley)  HAWQ (Pivotal)  BigSQL (IBM)  Apache Phoenix (for HBase)  Apache Tajo  Apache Drill  etc etc etc …. AND DON‘T FORGET HIVE !
  • 30.
    © 2015 EXASOLAG Four possible market outcomes…  Hadoop and SQL databases are on a collision course – only one will survive  No sign of that so far  They are complementary – both will survive  Probably - the challenge is how to make them work together  They will merge and become one  Some indications this is already starting to happen  Something even more amazing will come along and replace them both  Sometimes this happens – Spark ?
  • 31.
    © 2015 EXASOLAG What do the pundits say?  Martin Fowler – Thoughtworks  The rise of NoSQL databases marks the end of the era of relational database dominance  But NoSQL databases will not become the new dominators. Relational will still be popular, and used in the majority of situations. They, however, will no longer be the automatic choice.  The era of Polyglot Persistence has begun - where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data  Emil Eifrem – Neo Technology  When evaluating a NoSQL database, it is critical to demand enterprise- readiness. An enterprise delivering modern applications needs a NoSQL database that can manage today's complex and connected data while still delivering the enterprise strength, transactions and durability that IT departments have relied on for years.
  • 32.
    © 2015 EXASOLAG  Introduction & background  SQL vs NoSQL - observations  Case study  King – online gaming  What’s hot?  Q & A Agenda
  • 33.
    © 2015 EXASOLAG 37 King in numbers • 100 million daily active users • 1 billion game plays per day • 8 offices And lots and lots of data... • 14 billion rows per day • 500 Gb per day new • 700 Tb stored Case Study - King
  • 34.
    © 2015 EXASOLAG King - Getting to know 500 million players Objectives in game analytics 38 • Metrics and KPIs • Measure and understand player behaviour • Player segmentation • Improve player experience • Forecasting • Predictive modelling
  • 35.
    © 2015 EXASOLAG 39 Challenges at King • Extreme scale • Rate of growth • Speed of innovation • Cross platform • Virtual economies King - Getting to know 500 million players
  • 36.
    © 2015 EXASOLAG 40 The King formula • Data driven culture • Engaged business • Talented embedded data scientists • AB testing • Right technology platform • Right data model King - Getting to know 500 million players
  • 37.
    © 2015 EXASOLAG System architecture 41 How King does data Game servers Log server Reports Data scientists Data WarehouseTSV log files Dimensional model Raw data ETL
  • 38.
    © 2015 EXASOLAG Our data keeps growing... 42 How King does data King launches on mobile...
  • 39.
    © 2015 EXASOLAG …our technology has to keep up 43 How King does data Qlikview says no Infobright CE says no 10 node Hadoop 80 nodes 40 nodes 20 nodes InfiniDB Exasol
  • 40.
    © 2015 EXASOLAG Data platform 1.0 44 How King does data Games Event data Hive Reports Data scientists ETL
  • 41.
    © 2015 EXASOLAG Data platform 1.5 45 How King does data Games Event data Hive DB Reports Data scientists ETL
  • 42.
    © 2015 EXASOLAG 46 Why ExaSolution? • Speed • Efficiency • Tuning free • Scaling (150Tb and counting...) • ExaDudes How King does data
  • 43.
    © 2015 EXASOLAG Performance 47 How King does data
  • 44.
    © 2015 EXASOLAG 48 Games Event data Hive Exasol Reports Data scientists ETL Data platform 2.0 How King does data
  • 45.
    © 2015 EXASOLAG 49 Benefits • ETL times slashed • Cost saving • Tuning free • Scaling How King does data
  • 46.
    © 2015 EXASOLAG Data platform 3.0 50 Where next? Games Event data Exasol Hive Reports Data scientists ETL
  • 47.
    © 2015 EXASOLAG 51 Future challenges • Keep on scaling • Closer Hadoop integration • Evolving data model • Microbatch ETL • Real(er) time… Where next?
  • 48.
    © 2015 EXASOLAG  Introduction & background  SQL vs NoSQL - observations  Case study  King – online gaming  What’s hot?  Q & A Agenda
  • 49.
    © 2015 EXASOLAG What’s hot?
  • 50.
    © 2015 EXASOLAG • A definition: • The Internet of Things (IoT) is a scenario in which objects, animals or people are provided with unique identifiers and the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction • Basic concept has been around for decades – now accepted into the mainstream • Wide range of potential uses: • Environmental monitoring • Infrastructure management • Manufacturing • Energy management • Medical and healthcare systems • Building and home automation • Transport systems Internet of Things
  • 51.
    © 2015 EXASOLAG • Wearable technologies – e.g. smart watches, Google Glass • Bio sensors for humans (and other animals) • Health monitoring • Already in use on some dairy farms – optimise milk yields and give early warning for possible disease • Location based data • All modern phones provide location data (either GPS or cell based) • ‘crowd sourcing’ – e.g. traffic flow based on cellphone signals • Beacons – e.g. Regent Street in London • Location-based special offers and advertisement • Facial recognition • To drive targetted advertisements Other emerging technologies which produce data
  • 52.
    © 2015 EXASOLAG • Cloud being used for evaluation of new technologies and also as a platform for dev/test (and even DR) environments • In-database analytics using UDFs in languages such a R, Lua and Python • Move the processing closer to the data • Run analytics on full data volumes (no sampling/extract required) • Get improved performance due to parallelism (where possible) • Lots of freely available R code on the web • Automated conversion of analytical results to text (NLG) is emerging • AI rule-based generation of natural language output • Readable summaries and recommendations • Yseop, NarrativeScience, Automated Insights, Arria NLG Other emerging trends
  • 53.
    © 2015 EXASOLAG • Data and database technology isn’t going away! • New database approaches are being developed to address the requirements of flexibility, scalability etc • These technologies drive an increasing need for more analysts, database designers, data scientists • Hybrid systems are becoming the norm, with companies mixing ‘best of breed’ technologies (possibly open source) to get the best and most cost-effective results – use ‘the right tool for the job’ • SQL databases will continue to be widely utilised – but alongside other technologies and integration will become tighter Summary
  • 54.
    © 2015 EXASOLAG  Introduction & background  SQL vs NoSQL - observations  Case study  King – online gaming  What’s hot?  Q & A Agenda
  • 55.
    © 2015 EXASOLAG Dave Shuttleworth Twitter: @EXA_Daves Email: dave.shuttleworth@exasol.com Any questions?
  • 56.
  • 57.