Your SlideShare is downloading. ×
Splout SQL: A web-latency SQL view for Hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Splout SQL: A web-latency SQL view for Hadoop


Published on

Presentation for Big Data Beers @ Berlin, 12-2012. In this presentation I introduce Splout SQL, a new open-source data view for Hadoop which provides high-throughput, low-latency query times. Through …

Presentation for Big Data Beers @ Berlin, 12-2012. In this presentation I introduce Splout SQL, a new open-source data view for Hadoop which provides high-throughput, low-latency query times. Through these slides I show a recurring problem we have found when doing Hadoop consulting: movin data between processing and serving. I will show how Splout SQL solves it.

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Splout SQLA web-latency SQL view for Hadoop
  • 2. Who am I?● Pere Ferrera Bertran, Barcelona @ferrerabertran● 8 years “backender” @ BCN startups.● “The shy guy” (aka CTO) @ Datasalt ● Hadoop consulting: PeerIndex, Trovit, BBVA ● Open-source low-level API for Hadoop (Pangool) – Accepted paper: ICDM 2012● Jazz pianist in the free time
  • 3. 3.5 Big Data Challenges Moving Big Data seamlessly is also a challenge!
  • 4. Hadoop● Mainstream Big Data Storage & Batch Processing ● Open-source. ● Large community. ● Many higher-level tools. ● Many companies around. ● It scales.● Bad things people say about it: ● Slow – but we now have MapR! ● Hard to program – but we have Hive, Pig or Pangool! – and even things like Datameer! ● Buggy – but we have a stable 1.0 and supporting companies like Cloudera!● Getting better and better! - YARN (2.0)
  • 5. The Batch Revolution● Batch is not the only kind of processing ● But it covers many cases very well. All our consulting clients use it. –● Hadoop makes it transparently scalable! ● I see this as a revolution.● Advantages: Simple, resistant to programming errors.● Disadvantages: Long-running processes, results updated in hours time. ● My advise: Can you cope with that? Then use batch processing.● Ted Dunning & Nathan Marz are good “gurus” to hear talk about this.
  • 6. The problem (we want to solve)● Big Data usually means having Big Data as input● A lot of emphasis nowadays in “analytics”, where output is usually small ● Small, targeted reports. “I will eat all this so that almost nothing remains out of it... “
  • 7. The problem (we want to solve)● But the problem is that sometimes the output is also Big Data! ● Recommendations ● Aggregated stats ● Listings● Recurring problem: Take your “Big Data Output” and “put it” somewhere ● NoSQL ● Search engine● For being able to answer real-time queries, low-latency lookups over it. ● Websites, mobile apps. ● A lot of people using the app concurrently. ● Read-only!
  • 8. Current options● Hadoop-generated files are not (directly) queriable ● They lack appropriate indexes (e.g. b-tree) for making queries really fast● We can “send” the result of a Hadoop process directly to a database...● Problems: – Latency (random writing / rebalancing / index update) – Affecting query service (database may slow down while updating and serving at the same time) – Incrementality (may lead to inconsistency of results)
  • 9. Meet Splout SQL!● Store generation decoupled from store serving ● Data is always optimally indexed. ● Zero fragmentation.● “Atomic” deployment ● New versions replaced without affecting query serving. ● All data replaced at once. ● Flexible.● 100% SQL ● Rich query language ● Real-time aggregations over data ● Not everything needs to be pre-computed!
  • 10. Details● A very old idea which everyone implemented by hand at some point. ● Horizontal partitioning.● Generates many database files (partitions) and distributes them in a cluster. ● Replication, fail-over.● Hadoop (Pangool) for generating the data structures. ● Including all b-trees needed!● Database files: SQLite files.
  • 11. Did you say SQLite?
  • 12. SQLite● Fast (10% slower than MySQL)● Simple.● Probably the best embedded SQL out there. ● Embedding it makes it easy to use it inside Hadoop.● Still, it lacks some features. ● Not the database one would choose for an enterprise app.● But Splout SQL is essentially read-only! ● So we dont need that many features. Splout != SQLite. In the future we might integrate it with PostgreSQL, for instance.
  • 13. Making Splout SQL fly● Because database is created off-line, things like insertion order can be controlled. ● Hadoop sorts the data for you.● So you insert all your data in the appropriated order for making queries fast. ● Even if disk is used, only one seek will be needed (because of data locality). Real-time GROUP BY’s with avg. 2000 records of 50 bytes in average 40 milliseconds in a m1.small EC2 machine.
  • 14. Recap● We see a recurring problem when the output is also Big Data. ● Moving data between (batch) processing and serving.● Splout SQL solves it and adds full SQL. “A web-latency SQL view for Hadoop”● Web-latency: unlike data warehousing / analytics● SQL: unlike key/value and other NoSQLs● View: simply make files queriable → read-only● For Hadoop: for Big Data output of batch processing Check it out and play with it!