Speech given by Monica Franceschini, Solution Architecture Manager at the Big Data Competencey Center of Engineering Group, in occasion of the Data Driven Innovation Rome 2016 - Open Summit.
10. Moreover…
• Adoption of a well-established solution
• Availability of support services
• Community, open source or … free version!
11. Hadoop storage
HBaseHDFS
Large data sets
Unstructured data
Write-once-read-many access
Append-only file system
Hive HQL access
High-speed writes
and scans
Fault-tolerant
Replication
Many rows/columns
Compaction
Random read-writes
Updates
Rowkey access
Data modeling
NoSQL
Untyped data
Sparse schema
High throughput
Variable columns
14. Some Hbase features:
• Just one index or primary key
• Rowkey composed by other fields
• Big denormalized tables
• Horizontal partitioning rowkey-based
• Focus on the rowkey design and table schema (data modeling)
• The ACCESS PATTERN must be known in advance!
18. • Phoenix is fast: Full table scan of 100M rows usually executed in 20 seconds (narrow
table on a medium sized cluster). This time comes down to few milliseconds if query
contains filter on key columns.
• Phoenix follows the philosophy of bringing the computation to the data by using:
• coprocessors to perform operations on the server-side thus minimizing client/server
data transfer
• custom filters to prune data as close to the source as possible. In addition, Phoenix
uses native Hbase to minimize any startup costs.
Query chunks: Phoenix chunks up your query using the region boundaries and runs them in
parallel on the client using a configurable number of threads.
The aggregation will be done in a coprocessor on the server-side
19. • OLTP
• Analytic queries
• Hbase specific
• A lightweight solution
• Who else is going to use it?
20. • Query engine + metadata store + JDBC driver
• Database over HDFS (for bulk loads and full-table scans
queries)
• HBase APIs (not accessing Hfiles directly)
• …what about performances?…
Query: select count(1) from table over 1M and 5M
rows. Data is 3 narrow columns. Number of Region
Server: 1 (Virtual Machine, HBase heap: 2GB,
Processor: 2 cores @ 3.3GHz Xeon)
21. • Query engine + metadata store + JDBC driver
• DWH over HDFS
• Runs MapReduce jobs to query HBase
• StorageHanlder to read HBase
• …what about performances?…
Query: select count(1) from table over 10M and
100M rows. Data is 5 narrow columns. Number
of Region Servers: 4 (HBase heap: 10GB,
Processor: 6 cores @ 3.3GHz Xeon)
22. • Cassandra + Spark as lightweight solution (replacing Hbase+
Spark)
• SQL-like language (CQL) + secondary indexes
• …what about the other Hadoop tools?...
23. • Converged data platform: batch+NoSQL+streaming
• MapR-FS: great for throughput and files of every size +
singolar updates
• Apache Drill as SQL-layer on Mapr-FS
• …proprietary solution…
24. • Developed by Cloudera is Open Source (->integrated with
Hadoop Ecosystem)
• Low-latency random access
• Super-fast Columnar Storage
• Designed for Next-Generation Hardware (storage based on IO
of solid state drives + experimental cache implementation)
• …beta version…
With Kudu, Cloudera promises to solve Hadoop's infamous
storage problem
InfoWorld | Sep 28, 2015
25. HBaseHDFS
Hadoop storage
highly scalable in-memory
database per MPP workloads
Fast writes, fast updates,
fast reads, fast everything
Structured data
SQL+scan use cases
Unstructured data
Deep storage
Fixed column schema
SQL+scan use cases
Any type column
schema
Gets/puts/micro
scans
26. Conclusions
• One size doesn’t fit all the different
requirements
• The choice between different Open
Source solutions is driven by the
context
• Technology evolves
• So what?
• REQUIREMENTS
• NO LOCK-IN
• PEER-REVIEWS