Your SlideShare is downloading. ×
Splout SQL: A web-latency SQL view for Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Splout SQL: A web-latency SQL view for Hadoop

845
views

Published on

Presentation for Big Data Beers @ Berlin, 12-2012. In this presentation I introduce Splout SQL, a new open-source data view for Hadoop which provides high-throughput, low-latency query times. Through …

Presentation for Big Data Beers @ Berlin, 12-2012. In this presentation I introduce Splout SQL, a new open-source data view for Hadoop which provides high-throughput, low-latency query times. Through these slides I show a recurring problem we have found when doing Hadoop consulting: movin data between processing and serving. I will show how Splout SQL solves it.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
845
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Splout SQLA web-latency SQL view for Hadoop http://sploutsql.com
  • 2. Who am I?● Pere Ferrera Bertran, Barcelona @ferrerabertran● 8 years “backender” @ BCN startups.● “The shy guy” (aka CTO) @ Datasalt ● Hadoop consulting: PeerIndex, Trovit, BBVA ● Open-source low-level API for Hadoop (Pangool) – Accepted paper: ICDM 2012● Jazz pianist in the free time
  • 3. 3.5 Big Data Challenges Moving Big Data seamlessly is also a challenge!
  • 4. Hadoop● Mainstream Big Data Storage & Batch Processing ● Open-source. ● Large community. ● Many higher-level tools. ● Many companies around. ● It scales.● Bad things people say about it: ● Slow – but we now have MapR! ● Hard to program – but we have Hive, Pig or Pangool! – and even things like Datameer! ● Buggy – but we have a stable 1.0 and supporting companies like Cloudera!● Getting better and better! - YARN (2.0)
  • 5. The Batch Revolution● Batch is not the only kind of processing ● But it covers many cases very well. All our consulting clients use it. –● Hadoop makes it transparently scalable! ● I see this as a revolution.● Advantages: Simple, resistant to programming errors.● Disadvantages: Long-running processes, results updated in hours time. ● My advise: Can you cope with that? Then use batch processing.● Ted Dunning & Nathan Marz are good “gurus” to hear talk about this.
  • 6. The problem (we want to solve)● Big Data usually means having Big Data as input● A lot of emphasis nowadays in “analytics”, where output is usually small ● Small, targeted reports. “I will eat all this so that almost nothing remains out of it... “
  • 7. The problem (we want to solve)● But the problem is that sometimes the output is also Big Data! ● Recommendations ● Aggregated stats ● Listings● Recurring problem: Take your “Big Data Output” and “put it” somewhere ● NoSQL ● Search engine● For being able to answer real-time queries, low-latency lookups over it. ● Websites, mobile apps. ● A lot of people using the app concurrently. ● Read-only!
  • 8. Current options● Hadoop-generated files are not (directly) queriable ● They lack appropriate indexes (e.g. b-tree) for making queries really fast● We can “send” the result of a Hadoop process directly to a database...● Problems: – Latency (random writing / rebalancing / index update) – Affecting query service (database may slow down while updating and serving at the same time) – Incrementality (may lead to inconsistency of results)
  • 9. Meet Splout SQL!● Store generation decoupled from store serving ● Data is always optimally indexed. ● Zero fragmentation.● “Atomic” deployment ● New versions replaced without affecting query serving. ● All data replaced at once. ● Flexible.● 100% SQL ● Rich query language ● Real-time aggregations over data ● Not everything needs to be pre-computed!
  • 10. Details● A very old idea which everyone implemented by hand at some point. ● Horizontal partitioning.● Generates many database files (partitions) and distributes them in a cluster. ● Replication, fail-over.● Hadoop (Pangool) for generating the data structures. ● Including all b-trees needed!● Database files: SQLite files.
  • 11. Did you say SQLite?
  • 12. SQLite● Fast (10% slower than MySQL)● Simple.● Probably the best embedded SQL out there. ● Embedding it makes it easy to use it inside Hadoop.● Still, it lacks some features. ● Not the database one would choose for an enterprise app.● But Splout SQL is essentially read-only! ● So we dont need that many features. Splout != SQLite. In the future we might integrate it with PostgreSQL, for instance.
  • 13. Making Splout SQL fly● Because database is created off-line, things like insertion order can be controlled. ● Hadoop sorts the data for you.● So you insert all your data in the appropriated order for making queries fast. ● Even if disk is used, only one seek will be needed (because of data locality). Real-time GROUP BY’s with avg. 2000 records of 50 bytes in average 40 milliseconds in a m1.small EC2 machine.
  • 14. Recap● We see a recurring problem when the output is also Big Data. ● Moving data between (batch) processing and serving.● Splout SQL solves it and adds full SQL. “A web-latency SQL view for Hadoop”● Web-latency: unlike data warehousing / analytics● SQL: unlike key/value and other NoSQLs● View: simply make files queriable → read-only● For Hadoop: for Big Data output of batch processing Check it out and play with it! http://sploutsql.com

×