Splout SQLA web-latency SQL view for Hadoop          http://sploutsql.com
Who am I?●   Pere Ferrera Bertran, Barcelona @ferrerabertran●   8 years “backender” @ BCN startups.●   “The shy guy” (aka ...
3.5 Big Data Challenges         Moving Big Data seamlessly is also a challenge!
Hadoop●   Mainstream Big Data Storage & Batch Processing    ●   Open-source.    ●   Large community.    ●   Many higher-le...
The Batch Revolution●   Batch is not the only kind of processing    ●   But it covers many cases very well.         All ou...
The problem (we want to solve)●   Big Data usually means having Big Data as input●   A lot of emphasis nowadays in “analyt...
The problem (we want to solve)●    But the problem is that sometimes the output is also Big Data!    ●   Recommendations  ...
Current options●   Hadoop-generated files are not (directly) queriable    ●   They lack appropriate indexes (e.g. b-tree) ...
Meet Splout SQL!●   Store generation decoupled    from store serving    ●   Data is always optimally indexed.    ●   Zero ...
Details●   A very old idea which everyone implemented by hand at some    point.    ●   Horizontal partitioning.●    Genera...
Did you say SQLite?
SQLite●   Fast (10% slower than MySQL)●   Simple.●   Probably the best embedded SQL out there.    ●   Embedding it makes i...
Making Splout SQL fly●   Because database is created off-line, things like insertion order can be controlled.    ●   Hadoo...
Recap●    We see a recurring problem when the output is also Big Data.    ●   Moving data between (batch) processing and s...
Upcoming SlideShare
Loading in …5
×

Splout SQL: A web-latency SQL view for Hadoop

1,351 views

Published on

Presentation for Big Data Beers @ Berlin, 12-2012. In this presentation I introduce Splout SQL, a new open-source data view for Hadoop which provides high-throughput, low-latency query times. Through these slides I show a recurring problem we have found when doing Hadoop consulting: movin data between processing and serving. I will show how Splout SQL solves it.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,351
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Splout SQL: A web-latency SQL view for Hadoop

  1. 1. Splout SQLA web-latency SQL view for Hadoop http://sploutsql.com
  2. 2. Who am I?● Pere Ferrera Bertran, Barcelona @ferrerabertran● 8 years “backender” @ BCN startups.● “The shy guy” (aka CTO) @ Datasalt ● Hadoop consulting: PeerIndex, Trovit, BBVA ● Open-source low-level API for Hadoop (Pangool) – Accepted paper: ICDM 2012● Jazz pianist in the free time
  3. 3. 3.5 Big Data Challenges Moving Big Data seamlessly is also a challenge!
  4. 4. Hadoop● Mainstream Big Data Storage & Batch Processing ● Open-source. ● Large community. ● Many higher-level tools. ● Many companies around. ● It scales.● Bad things people say about it: ● Slow – but we now have MapR! ● Hard to program – but we have Hive, Pig or Pangool! – and even things like Datameer! ● Buggy – but we have a stable 1.0 and supporting companies like Cloudera!● Getting better and better! - YARN (2.0)
  5. 5. The Batch Revolution● Batch is not the only kind of processing ● But it covers many cases very well. All our consulting clients use it. –● Hadoop makes it transparently scalable! ● I see this as a revolution.● Advantages: Simple, resistant to programming errors.● Disadvantages: Long-running processes, results updated in hours time. ● My advise: Can you cope with that? Then use batch processing.● Ted Dunning & Nathan Marz are good “gurus” to hear talk about this.
  6. 6. The problem (we want to solve)● Big Data usually means having Big Data as input● A lot of emphasis nowadays in “analytics”, where output is usually small ● Small, targeted reports. “I will eat all this so that almost nothing remains out of it... “
  7. 7. The problem (we want to solve)● But the problem is that sometimes the output is also Big Data! ● Recommendations ● Aggregated stats ● Listings● Recurring problem: Take your “Big Data Output” and “put it” somewhere ● NoSQL ● Search engine● For being able to answer real-time queries, low-latency lookups over it. ● Websites, mobile apps. ● A lot of people using the app concurrently. ● Read-only!
  8. 8. Current options● Hadoop-generated files are not (directly) queriable ● They lack appropriate indexes (e.g. b-tree) for making queries really fast● We can “send” the result of a Hadoop process directly to a database...● Problems: – Latency (random writing / rebalancing / index update) – Affecting query service (database may slow down while updating and serving at the same time) – Incrementality (may lead to inconsistency of results)
  9. 9. Meet Splout SQL!● Store generation decoupled from store serving ● Data is always optimally indexed. ● Zero fragmentation.● “Atomic” deployment ● New versions replaced without affecting query serving. ● All data replaced at once. ● Flexible.● 100% SQL ● Rich query language ● Real-time aggregations over data ● Not everything needs to be pre-computed!
  10. 10. Details● A very old idea which everyone implemented by hand at some point. ● Horizontal partitioning.● Generates many database files (partitions) and distributes them in a cluster. ● Replication, fail-over.● Hadoop (Pangool) for generating the data structures. ● Including all b-trees needed!● Database files: SQLite files.
  11. 11. Did you say SQLite?
  12. 12. SQLite● Fast (10% slower than MySQL)● Simple.● Probably the best embedded SQL out there. ● Embedding it makes it easy to use it inside Hadoop.● Still, it lacks some features. ● Not the database one would choose for an enterprise app.● But Splout SQL is essentially read-only! ● So we dont need that many features. Splout != SQLite. In the future we might integrate it with PostgreSQL, for instance.
  13. 13. Making Splout SQL fly● Because database is created off-line, things like insertion order can be controlled. ● Hadoop sorts the data for you.● So you insert all your data in the appropriated order for making queries fast. ● Even if disk is used, only one seek will be needed (because of data locality). Real-time GROUP BY’s with avg. 2000 records of 50 bytes in average 40 milliseconds in a m1.small EC2 machine.
  14. 14. Recap● We see a recurring problem when the output is also Big Data. ● Moving data between (batch) processing and serving.● Splout SQL solves it and adds full SQL. “A web-latency SQL view for Hadoop”● Web-latency: unlike data warehousing / analytics● SQL: unlike key/value and other NoSQLs● View: simply make files queriable → read-only● For Hadoop: for Big Data output of batch processing Check it out and play with it! http://sploutsql.com

×