Your SlideShare is downloading. ×
0
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Splout SQL - Web latency SQL views for Hadoop

15,609

Published on

There are many Big Data problems whose output is also Big Data. In this presentation we will show Splout SQL, which allows serving an arbitrarily big dataset by partitioning it. Splout serves …

There are many Big Data problems whose output is also Big Data. In this presentation we will show Splout SQL, which allows serving an arbitrarily big dataset by partitioning it. Splout serves partitioned SQL views which are generated and indexed by Hadoop. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Hadoop is nowadays the de-facto open-source solution for Big Data batch-processing. When the output of a Hadoop process is big, there isn`t a satisfying solution for serving it. Think of pre-computed recommendations, for example, where the whole dataset may vary from one day to another. Splout decouples database creation from database serving and makes it efficient and safe to deploy Hadoop-generated datasets. There are many databases that allow serving Big Data such as NoSQL solutions, but they don`t have a rich query language like SQL. You generally can`t aggregate data in real-time like you would do with a GROUP BY clause. Because you can`t precompute everything, SQL is a very convenient feature to have in a Big Data serving solution. Splout is not a “fast analytics” engine. Splout is made for demanding web or mobile applications where query performance is critical. Arbitrary real-time aggregations should be done in less than 200 milliseconds under high traffic load. On top of that, Splout is scalable, flexible, RESTful & open-source.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
15,609
On Slideshare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
23
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Iván de Prado Alonso – CEO of Datasaltwww.datasalt.es@ivanprado@datasalt Splout SQL When Big Data Output is also Big Data
  • 2. Full SQL* Unlike NoSQLFor Big Data Unlike RDBMSWeb latency & Unlike Impala,throughput Apache Drill, etc.* Within each partition
  • 3. How does it work? Isolation between generation and serving
  • 4. Generate tablespace CLIENTS_INFO withGeneration table CLIENTS partitioned by CID table SALES partitioned by CID Table CLIENTS Tablespace CLIENTS_INFO CID Name Partition U10 – U35 U20 Doug Table CLIENTS Table SALES U21 Ted CID Name SID CID Amount U40 John U20 Doug S100 U20 102 U21 Ted S101 U20 60 Table SALES Partition U36 – U60 SID CID Amount Table CLIENTS Table SALES S100 U20 102 CID Name SID CID Amount S101 U20 60 U40 John S223 U40 99 S223 U40 99
  • 5. For key = ‘U20’, tablespace=‘CLIENTS_INFO’ SELECT Name, sum(Amount) FROMServing CLIENTS c, SALES s WHERE c.CID = s.CID AND CID = ‘U20’; Partition U10 – U35 Partition U36 – U60 Table CLIENTS Table CLIENTS CID Name CID Name U20 Doug U40 John U21 Ted Table SALES Table SALES SID CID Amount SID CID Amount S100 U20 102 S223 U40 99 S101 U20 60
  • 6. For key = ‘U40’, tablespace=‘CLIENTS_INFO’ SELECT Name, sum(Amount) FROMServing CLIENTS c, SALES s WHERE c.CID = s.CID AND CID = ‘U40’; Partition U10 – U35 Partition U36 – U60 Table CLIENTS Table CLIENTS CID Name CID Name U20 Doug U40 John U21 Ted Table SALES Table SALES SID CID Amount SID CID Amount S100 U20 102 S223 U40 99 S101 U20 60
  • 7. Why does it scale? Data is partitioned Partitions are distributed across nodes Adding more nodes increases capacity Queries restricted to a single partition Generation does not impact serving
  • 8. Ok, so what is Splout SQL useful for?
  • 9. Big DataAnalytics Manageable output
  • 10. Big Data AnalyticsSometimes Big Data output is also Big Data
  • 11. Splout SQL allows to serve Big Data results
  • 12. Let’s see an example …
  • 13. Building a Google AnalyticsImagine that one crazy day you decide to buildsome kind of Google Analytics… Zillions of events Millions of domains Individual panel per domain
  • 14. Requirements Time-based charts (day/hour aggregations) Flexible dimension breakdown Per page, per browser Per country, per language …
  • 15. With Splout SQL
  • 16. Splout SQL provides SQL consolidated views for Hadoop data
  • 17. Let’s see more details about Splout SQL
  • 18. Splout SQL Architecture
  • 19. Each partition is … Backed by SQLite Generated on Hadoop Including any indexes needed Data can be sorted before insertion to minimize disk seeks at query time Pre-sampling for balancing partition size Distributed on Splout SQL cluster With replication for failover
  • 20. Atomicity A tablespace is a set of tables that share the same partitioning schema Tablespaces are versioned Only one version served at a time Several tablespaces can be deployed at once All-or-nothing semantics (atomicity) Rollback support
  • 21. Characteristics Ensured ms latencies Even when queries hit disk Controlled by the developer selecting the proper: - Cluster topology - Partitioning - Indexes - Data collocation (insertion order)
  • 22. Characteristics (II) 100% SQL But restricted to a single partition Real-time aggregations Joins Scalability In data capacity In performance
  • 23. Characteristics (III) Atomicity New data replaces old data all at once High availability Through the use of replication Open Source
  • 24. Characteristics (IV) Easy to manage Changing the size of the cluster can be done without any downtime Read only Data is updated in batches Updates come from new tablespace deployments
  • 25. Characteristics (V) Native connectors Hive Pig Cascading
  • 26. API - Generation Command line Loading CSV files $ hadoop jar splout-*-hadoop.jar generate … Java API Connectors
  • 27. API - Service Rest API JSON response
  • 28. API - Console
  • 29. Benchmark 350 GB Wikipedia logs Aggregation queries impacting 15 rows in average 2-machines cluster 900 queries/second, 80 ms/query, 80 threads
  • 30. Benchmark (II) 4-machines cluster 3150 queries/second, 40 ms/query, 160 threads More info: http://sploutsql.com/performance.html
  • 31. Web latency SQL Consolidated Views For Hadoop“A good candidate for the serving layer of a lambda architecture”
  • 32. www.SploutCloud.com - Splout SQL as a service
  • 33. Future work Growing the community Do you want to collaborate?  Automatic rebalancing on failover Almost done Some read/write capabilities Enabling Splout SQL to become the speed layer on lambda architectures
  • 34. Iván de Prado Alonso – CEO of Datasaltwww.datasalt.es@ivanprado@datasalt Questions?

×