The Presto/Accumulo connector has been in production for over 18 months. It's been successful overall, but we have had some pain points along the way with some tech debt as well as design decisions.
During this session, we'll briefly review the Accumulo connector for Presto as well as the use case that drove the initial implementation. We'll discuss the pain points we have experienced with the connector, and the latest features and changes to the connector to improve query performance, ingestion, and ease of use.
– Speaker –
Adam Shook
Principal Consultant, Datacatessen
Adam Shook is Founder and Principal Consultant at Datacatessen, a boutique big data solutions company specializing in data architecture and engineering. Shook graduated with a B.S. in Computer Science from the University of Maryland Baltimore County (UMBC) and took a job building a new high-performance graphics engine for a game studio. Looking for new challenges, he enrolled in the Computer Science graduate program at UMBC focusing on distributed computing technologies. Shook has worked on developing a wide variety of data applications and analytics deployed on large-scale production data platforms in both the commercial and government spaces. He is involved in developing and instructing graduate and undergraduate courses at UMBC, preparing young minds to work with big data. He spends what little free time he has playing video games and homebrewing.
— More Information —
For more information see http://www.accumulosummit.com/
4. Presto-Accumulo Review
• Open-source and built by Facebook
– MPP OLAP engine with pluggable storage
• ANSI SQL for NoSQL
• Aim is to accelerate relational OLTP use cases by abstracting
away common Accumulo design patterns
• Load data using SQL or Java
• Supports predicate pushdown via advanced indexes and
metrics
• Queries ranging from milliseconds to seconds
• Available since Presto 0.153
– See https://github.com/bloomberg/presto for the latest
features
Datacatessen
5. Client Coordinator
WorkerWorker Worker Worker
Accumulo
Coordinator leverages
indexes and optimizations
to gather Ranges to scan
Each worker is given a
subset of the Ranges to
read from Accumulo in
parallel via
BatchScanners
Workers pull data from
Accumulo, converting it
into Presto’s internal
object model
Accumulo’s job is done,
Presto takes over to
shuffle data as needed
and complete the query
Presto/Accumulo Workflow
Datacatessen
13. Bottleneck in Index Retrieval
• Three new features/optimizations
– ThreadPool to fetch row IDs in parallel
– Composite indexes
– Distributing the index lookup to Workers
• LL: More parallelism and more indexes make
faster queries
Datacatessen
'alice' and 'wendy'
'alice' and 'erin’ 'erin' and 'olivia' 'oscar' and 'wendy'
From Coordinator
To Worker
20. Index Hotspotting
• Indexing 1.0 was a quick win
– Basic reverse index
– Caused all of the problems you want to avoid with
indexes
– Low cardinality columns caused very wide rows
– Timestamp columns are monotonic increasing
– Key distribution was all over the place due to different
data types in the same table
– Deleting columns requires configuring RegexFilters
and compacting/merging the tables
Datacatessen