• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Splout SQL: A web-latency SQL view for Hadoop
 

Splout SQL: A web-latency SQL view for Hadoop

on

  • 968 views

Presentation for Big Data Beers @ Berlin, 12-2012. In this presentation I introduce Splout SQL, a new open-source data view for Hadoop which provides high-throughput, low-latency query times. Through ...

Presentation for Big Data Beers @ Berlin, 12-2012. In this presentation I introduce Splout SQL, a new open-source data view for Hadoop which provides high-throughput, low-latency query times. Through these slides I show a recurring problem we have found when doing Hadoop consulting: movin data between processing and serving. I will show how Splout SQL solves it.

Statistics

Views

Total Views
968
Views on SlideShare
967
Embed Views
1

Actions

Likes
0
Downloads
12
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Splout SQL: A web-latency SQL view for Hadoop Splout SQL: A web-latency SQL view for Hadoop Presentation Transcript

    • Splout SQLA web-latency SQL view for Hadoop http://sploutsql.com
    • Who am I?● Pere Ferrera Bertran, Barcelona @ferrerabertran● 8 years “backender” @ BCN startups.● “The shy guy” (aka CTO) @ Datasalt ● Hadoop consulting: PeerIndex, Trovit, BBVA ● Open-source low-level API for Hadoop (Pangool) – Accepted paper: ICDM 2012● Jazz pianist in the free time
    • 3.5 Big Data Challenges Moving Big Data seamlessly is also a challenge!
    • Hadoop● Mainstream Big Data Storage & Batch Processing ● Open-source. ● Large community. ● Many higher-level tools. ● Many companies around. ● It scales.● Bad things people say about it: ● Slow – but we now have MapR! ● Hard to program – but we have Hive, Pig or Pangool! – and even things like Datameer! ● Buggy – but we have a stable 1.0 and supporting companies like Cloudera!● Getting better and better! - YARN (2.0)
    • The Batch Revolution● Batch is not the only kind of processing ● But it covers many cases very well. All our consulting clients use it. –● Hadoop makes it transparently scalable! ● I see this as a revolution.● Advantages: Simple, resistant to programming errors.● Disadvantages: Long-running processes, results updated in hours time. ● My advise: Can you cope with that? Then use batch processing.● Ted Dunning & Nathan Marz are good “gurus” to hear talk about this.
    • The problem (we want to solve)● Big Data usually means having Big Data as input● A lot of emphasis nowadays in “analytics”, where output is usually small ● Small, targeted reports. “I will eat all this so that almost nothing remains out of it... “
    • The problem (we want to solve)● But the problem is that sometimes the output is also Big Data! ● Recommendations ● Aggregated stats ● Listings● Recurring problem: Take your “Big Data Output” and “put it” somewhere ● NoSQL ● Search engine● For being able to answer real-time queries, low-latency lookups over it. ● Websites, mobile apps. ● A lot of people using the app concurrently. ● Read-only!
    • Current options● Hadoop-generated files are not (directly) queriable ● They lack appropriate indexes (e.g. b-tree) for making queries really fast● We can “send” the result of a Hadoop process directly to a database...● Problems: – Latency (random writing / rebalancing / index update) – Affecting query service (database may slow down while updating and serving at the same time) – Incrementality (may lead to inconsistency of results)
    • Meet Splout SQL!● Store generation decoupled from store serving ● Data is always optimally indexed. ● Zero fragmentation.● “Atomic” deployment ● New versions replaced without affecting query serving. ● All data replaced at once. ● Flexible.● 100% SQL ● Rich query language ● Real-time aggregations over data ● Not everything needs to be pre-computed!
    • Details● A very old idea which everyone implemented by hand at some point. ● Horizontal partitioning.● Generates many database files (partitions) and distributes them in a cluster. ● Replication, fail-over.● Hadoop (Pangool) for generating the data structures. ● Including all b-trees needed!● Database files: SQLite files.
    • Did you say SQLite?
    • SQLite● Fast (10% slower than MySQL)● Simple.● Probably the best embedded SQL out there. ● Embedding it makes it easy to use it inside Hadoop.● Still, it lacks some features. ● Not the database one would choose for an enterprise app.● But Splout SQL is essentially read-only! ● So we dont need that many features. Splout != SQLite. In the future we might integrate it with PostgreSQL, for instance.
    • Making Splout SQL fly● Because database is created off-line, things like insertion order can be controlled. ● Hadoop sorts the data for you.● So you insert all your data in the appropriated order for making queries fast. ● Even if disk is used, only one seek will be needed (because of data locality). Real-time GROUP BY’s with avg. 2000 records of 50 bytes in average 40 milliseconds in a m1.small EC2 machine.
    • Recap● We see a recurring problem when the output is also Big Data. ● Moving data between (batch) processing and serving.● Splout SQL solves it and adds full SQL. “A web-latency SQL view for Hadoop”● Web-latency: unlike data warehousing / analytics● SQL: unlike key/value and other NoSQLs● View: simply make files queriable → read-only● For Hadoop: for Big Data output of batch processing Check it out and play with it! http://sploutsql.com