SlideShare a Scribd company logo
Building Bulk Import
Eric Newton
SW Complete, Inc.
Ingest 101
The life of a mutation:
○ Send to server
○ Write to Write-Ahead Log
○ Store in memory
○ Write memory to a file
○ Merge, re-write as needed
At least two writes, small in-memory sort
Bulk Ingest
● Accumulo: heavy ingest
● Use Map-Reduce efficiency
● Pre-sort incoming data
● Hand whole sorted files to accumulo
● One write
● Larger sorts
Version 1
importDirectory(String dir, String failDir)
Client computes the servers that need the file:
■ Analysis of file
■ Moves directory under Accumulo
■ Retry logic (to handle splits, failures)
Version 1: problems
● Limited to the client’s computational power
● Permission: client had to be all-knowing
● Files could be added to servers many times
● Defer file collection while bulk importing
● Clients can fail
Version 1.1
● Clients hand bulk imports to the master
● Fixed permission problems
● Added a bulk import test to the Random
Walk test suit
The Test
Create a sorted file with a lot of 1’s:
12345678 -> 1
Create an identical file, with lots of -1’s:
12345678 -> -1
Add a summation iterator over the table
Verify: every entry should be zero:
12345678 -> 0
Random Walk 101
Randomly:
○ Import files in random order
○ Split the files into random sizes
○ Split tablets and random points
○ Kill tablet servers (agitate)
Under load
At scale
Version 2.0
● Master only coordinates file processing
● Distribute work to tablet servers
● Master distributes files to tablet servers
● Tablet serves
○ Analyze files for assignments
○ Retry
○ Communicate with destination tablet servers
FilesFilesFiles
Command Flow
Client Master
Tablet
Server
Tablet
Server
Tablet
Server
Tablet
Server
Problems
Problems solved:
○ Permissions controlled by Master
○ File processing is distributed
Bulk import tested heavily
Consistency
○ Not so much
Version 3.0
Problem: file imported more than once
○ Repeated reloading
○ RPC timeouts
○ Tablet migration
○ Tablet split
Add flags to metadata table to prevent:
○ file garbage collection
○ repeated imports
Version 3.0
● Problem Solved
○ Reduced Name Node ops
○ Reduced trash laying around from failed imports
○ Imports not repeated
● Does it stand up to the Random Walk Test?
○ Not so much
Death by Slow Thread
“Please take this file”
“OK… looks good, never saw this before”
Sleep
“Hey, Please take this file, again”
“OK… looks good, never saw this before”
“Thanks!”
Compact!
Wakeup! Time to import that file from the 1st
request!
Zookeeper to the Rescue
Add a 3rd
party negotiator
● Define a session
● Add a file only while session is active
● Store session in zookeeper
● Get agreement about the session at each critical point,
including metadata table updates
Session guides clean-up of markers
Session closes only after all agree
Session
“Take this file, for session 123”
“OK, working on session 123”
“Never saw this file before”
Sleep
“Repeat, file for session 123”
“Never saw this file before”
“Anybody working on session 123?”
Session
“Yes, I’m processing session 123!”
“OK, finish up.”
Double check on session 123, import file
“Anybody working on session 123?”
“What’s session 123?”
Remove markers in metadata
Bulk Import
● Problem solved:
○ Distributed processing
○ Permissions
○ Files imported once, and only once
○ Markers are cleaned up in the face of failures
● Performance
○ Not so much
Master as Bottleneck
Bulk import thousands of files
Every 15 minutes
Master renames files and puts them under /accumulo
Bottleneck
One master competes for NN ops with N tablet servers
Master, Go Faster
Add configurable thread pool
Push more move requests to the NN
Compete more fiercely for resources
Bulk Import
More efficient data ingest
Easy: just a file to the right tablets
Hard: consistency in a distributed system
Testing is your friend
Nothing prepares you for large-scale problems!

More Related Content

What's hot

Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyOlivier Bourgeois
 
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael Medin
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael MedinOSMC 2012 | Distributed Monitoring mit NSClient++ by Michael Medin
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael MedinNETWAYS
 
HTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS moduleHTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS moduleKazuho Oku
 
Writing a fast HTTP parser
Writing a fast HTTP parserWriting a fast HTTP parser
Writing a fast HTTP parserfukamachi
 
4 exercises for part 1
4   exercises for part 14   exercises for part 1
4 exercises for part 1drewz lin
 
Gsummit apis-2013
Gsummit apis-2013Gsummit apis-2013
Gsummit apis-2013Gluster.org
 
Lcna tutorial-2012
Lcna tutorial-2012Lcna tutorial-2012
Lcna tutorial-2012Gluster.org
 
Harden Security Devices Against Increasingly Sophisticated Evasions
Harden Security Devices Against Increasingly Sophisticated EvasionsHarden Security Devices Against Increasingly Sophisticated Evasions
Harden Security Devices Against Increasingly Sophisticated EvasionsIxia
 
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017Codemotion
 
AV Evasion with the Veil Framework
AV Evasion with the Veil FrameworkAV Evasion with the Veil Framework
AV Evasion with the Veil FrameworkVeilFramework
 
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataGlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataRoberto Franchini
 
Woo: Writing a fast web server @ ELS2015
Woo: Writing a fast web server @ ELS2015Woo: Writing a fast web server @ ELS2015
Woo: Writing a fast web server @ ELS2015fukamachi
 
Multi-core Node.pdf
Multi-core Node.pdfMulti-core Node.pdf
Multi-core Node.pdfAhmed Hassan
 
LinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingLinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingAndrey Vagin
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiGluster.org
 

What's hot (20)

Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development Efficiency
 
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael Medin
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael MedinOSMC 2012 | Distributed Monitoring mit NSClient++ by Michael Medin
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael Medin
 
HTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS moduleHTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS module
 
Writing a fast HTTP parser
Writing a fast HTTP parserWriting a fast HTTP parser
Writing a fast HTTP parser
 
4 exercises for part 1
4   exercises for part 14   exercises for part 1
4 exercises for part 1
 
Veil-Ordnance
Veil-OrdnanceVeil-Ordnance
Veil-Ordnance
 
Barcamp presentation
Barcamp presentationBarcamp presentation
Barcamp presentation
 
Gsummit apis-2013
Gsummit apis-2013Gsummit apis-2013
Gsummit apis-2013
 
Lcna tutorial-2012
Lcna tutorial-2012Lcna tutorial-2012
Lcna tutorial-2012
 
Harden Security Devices Against Increasingly Sophisticated Evasions
Harden Security Devices Against Increasingly Sophisticated EvasionsHarden Security Devices Against Increasingly Sophisticated Evasions
Harden Security Devices Against Increasingly Sophisticated Evasions
 
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
 
AV Evasion with the Veil Framework
AV Evasion with the Veil FrameworkAV Evasion with the Veil Framework
AV Evasion with the Veil Framework
 
SELinux by Example
SELinux by ExampleSELinux by Example
SELinux by Example
 
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataGlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big data
 
Woo: Writing a fast web server @ ELS2015
Woo: Writing a fast web server @ ELS2015Woo: Writing a fast web server @ ELS2015
Woo: Writing a fast web server @ ELS2015
 
Multi-core Node.pdf
Multi-core Node.pdfMulti-core Node.pdf
Multi-core Node.pdf
 
rsyslog meets docker
rsyslog meets dockerrsyslog meets docker
rsyslog meets docker
 
LinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingLinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel Testing
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan Rossi
 
Linux for Beginners
Linux for  BeginnersLinux for  Beginners
Linux for Beginners
 

Viewers also liked

Accumulo design
Accumulo designAccumulo design
Accumulo designscsorensen
 
Accumulo meetup 20130109
Accumulo meetup 20130109Accumulo meetup 20130109
Accumulo meetup 20130109Sqrrl
 
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...Accumulo Summit
 
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]Accumulo Summit
 
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]Accumulo Summit
 
Accumulo Summit 2014: Monitoring Apache Accumulo
Accumulo Summit 2014: Monitoring Apache AccumuloAccumulo Summit 2014: Monitoring Apache Accumulo
Accumulo Summit 2014: Monitoring Apache AccumuloAccumulo Summit
 
Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data LakeAaron Cordova
 
Accumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the EnterpriseAccumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the EnterpriseAccumulo Summit
 
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit
 
Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo ClustersAaron Cordova
 
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit
 
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick IntroductionJames Salter
 
Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411Sqrrl
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit
 
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big DataOct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big DataYahoo Developer Network
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit
 
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...Accumulo Summit
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo OverviewBill Havanki
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosJoe Stein
 

Viewers also liked (20)

Accumulo design
Accumulo designAccumulo design
Accumulo design
 
Accumulo meetup 20130109
Accumulo meetup 20130109Accumulo meetup 20130109
Accumulo meetup 20130109
 
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
 
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
 
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
 
Accumulo Summit 2014: Monitoring Apache Accumulo
Accumulo Summit 2014: Monitoring Apache AccumuloAccumulo Summit 2014: Monitoring Apache Accumulo
Accumulo Summit 2014: Monitoring Apache Accumulo
 
Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data Lake
 
Accumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the EnterpriseAccumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the Enterprise
 
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
 
Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo Clusters
 
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
 
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick Introduction
 
Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
 
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big DataOct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
 
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo Overview
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
 

Similar to Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Accumulo14 15
Accumulo14 15Accumulo14 15
Accumulo14 15Sqrrl
 
Lcna 2012-example
Lcna 2012-exampleLcna 2012-example
Lcna 2012-exampleGluster.org
 
Accumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAccumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAaron Cordova
 
Stashaway 1
Stashaway 1Stashaway 1
Stashaway 1priestc
 
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagBrown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagCask Data
 
Bacula Overview
Bacula OverviewBacula Overview
Bacula Overviewsambismo
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Linux Distribution Automated Testing
 Linux Distribution Automated Testing Linux Distribution Automated Testing
Linux Distribution Automated TestingAleksander Baranowski
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
Toolchain Independent Distributed Compilation
Toolchain Independent Distributed CompilationToolchain Independent Distributed Compilation
Toolchain Independent Distributed CompilationDietmar Hauser
 
Source andassetcontrolingamedev
Source andassetcontrolingamedevSource andassetcontrolingamedev
Source andassetcontrolingamedevMatt Benic
 
Introduction to containers
Introduction to containersIntroduction to containers
Introduction to containersNitish Jadia
 
Remote file path traversal attacks for fun and profit
Remote file path traversal attacks for fun and profitRemote file path traversal attacks for fun and profit
Remote file path traversal attacks for fun and profitDharmalingam Ganesan
 
Technical track-afterimaging Progress Database
Technical track-afterimaging Progress DatabaseTechnical track-afterimaging Progress Database
Technical track-afterimaging Progress DatabaseVinh Nguyen
 
Course 102: Lecture 16: Process Management (Part 2)
Course 102: Lecture 16: Process Management (Part 2) Course 102: Lecture 16: Process Management (Part 2)
Course 102: Lecture 16: Process Management (Part 2) Ahmed El-Arabawy
 
Utopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K usersUtopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K usersJaime Buelta
 
Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+Python Ireland
 
Google File System
Google File SystemGoogle File System
Google File SystemDreamJobs1
 

Similar to Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored] (20)

Accumulo14 15
Accumulo14 15Accumulo14 15
Accumulo14 15
 
Lcna 2012-example
Lcna 2012-exampleLcna 2012-example
Lcna 2012-example
 
Accumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAccumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and Roadmap
 
Stashaway 1
Stashaway 1Stashaway 1
Stashaway 1
 
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagBrown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
 
Bacula Overview
Bacula OverviewBacula Overview
Bacula Overview
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Linux Distribution Automated Testing
 Linux Distribution Automated Testing Linux Distribution Automated Testing
Linux Distribution Automated Testing
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
Toolchain Independent Distributed Compilation
Toolchain Independent Distributed CompilationToolchain Independent Distributed Compilation
Toolchain Independent Distributed Compilation
 
Source andassetcontrolingamedev
Source andassetcontrolingamedevSource andassetcontrolingamedev
Source andassetcontrolingamedev
 
Introduction to containers
Introduction to containersIntroduction to containers
Introduction to containers
 
SNIA SDC 2016 final
SNIA SDC 2016 finalSNIA SDC 2016 final
SNIA SDC 2016 final
 
Remote file path traversal attacks for fun and profit
Remote file path traversal attacks for fun and profitRemote file path traversal attacks for fun and profit
Remote file path traversal attacks for fun and profit
 
CSA-lecture 6.pptx
CSA-lecture 6.pptxCSA-lecture 6.pptx
CSA-lecture 6.pptx
 
Technical track-afterimaging Progress Database
Technical track-afterimaging Progress DatabaseTechnical track-afterimaging Progress Database
Technical track-afterimaging Progress Database
 
Course 102: Lecture 16: Process Management (Part 2)
Course 102: Lecture 16: Process Management (Part 2) Course 102: Lecture 16: Process Management (Part 2)
Course 102: Lecture 16: Process Management (Part 2)
 
Utopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K usersUtopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K users
 
Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+
 
Google File System
Google File SystemGoogle File System
Google File System
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 

Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

  • 1. Building Bulk Import Eric Newton SW Complete, Inc.
  • 2. Ingest 101 The life of a mutation: ○ Send to server ○ Write to Write-Ahead Log ○ Store in memory ○ Write memory to a file ○ Merge, re-write as needed At least two writes, small in-memory sort
  • 3. Bulk Ingest ● Accumulo: heavy ingest ● Use Map-Reduce efficiency ● Pre-sort incoming data ● Hand whole sorted files to accumulo ● One write ● Larger sorts
  • 4. Version 1 importDirectory(String dir, String failDir) Client computes the servers that need the file: ■ Analysis of file ■ Moves directory under Accumulo ■ Retry logic (to handle splits, failures)
  • 5. Version 1: problems ● Limited to the client’s computational power ● Permission: client had to be all-knowing ● Files could be added to servers many times ● Defer file collection while bulk importing ● Clients can fail
  • 6. Version 1.1 ● Clients hand bulk imports to the master ● Fixed permission problems ● Added a bulk import test to the Random Walk test suit
  • 7. The Test Create a sorted file with a lot of 1’s: 12345678 -> 1 Create an identical file, with lots of -1’s: 12345678 -> -1 Add a summation iterator over the table Verify: every entry should be zero: 12345678 -> 0
  • 8. Random Walk 101 Randomly: ○ Import files in random order ○ Split the files into random sizes ○ Split tablets and random points ○ Kill tablet servers (agitate) Under load At scale
  • 9. Version 2.0 ● Master only coordinates file processing ● Distribute work to tablet servers ● Master distributes files to tablet servers ● Tablet serves ○ Analyze files for assignments ○ Retry ○ Communicate with destination tablet servers
  • 11. Problems Problems solved: ○ Permissions controlled by Master ○ File processing is distributed Bulk import tested heavily Consistency ○ Not so much
  • 12. Version 3.0 Problem: file imported more than once ○ Repeated reloading ○ RPC timeouts ○ Tablet migration ○ Tablet split Add flags to metadata table to prevent: ○ file garbage collection ○ repeated imports
  • 13. Version 3.0 ● Problem Solved ○ Reduced Name Node ops ○ Reduced trash laying around from failed imports ○ Imports not repeated ● Does it stand up to the Random Walk Test? ○ Not so much
  • 14. Death by Slow Thread “Please take this file” “OK… looks good, never saw this before” Sleep “Hey, Please take this file, again” “OK… looks good, never saw this before” “Thanks!” Compact! Wakeup! Time to import that file from the 1st request!
  • 15. Zookeeper to the Rescue Add a 3rd party negotiator ● Define a session ● Add a file only while session is active ● Store session in zookeeper ● Get agreement about the session at each critical point, including metadata table updates Session guides clean-up of markers Session closes only after all agree
  • 16. Session “Take this file, for session 123” “OK, working on session 123” “Never saw this file before” Sleep “Repeat, file for session 123” “Never saw this file before” “Anybody working on session 123?”
  • 17. Session “Yes, I’m processing session 123!” “OK, finish up.” Double check on session 123, import file “Anybody working on session 123?” “What’s session 123?” Remove markers in metadata
  • 18. Bulk Import ● Problem solved: ○ Distributed processing ○ Permissions ○ Files imported once, and only once ○ Markers are cleaned up in the face of failures ● Performance ○ Not so much
  • 19. Master as Bottleneck Bulk import thousands of files Every 15 minutes Master renames files and puts them under /accumulo Bottleneck One master competes for NN ops with N tablet servers
  • 20. Master, Go Faster Add configurable thread pool Push more move requests to the NN Compete more fiercely for resources
  • 21. Bulk Import More efficient data ingest Easy: just a file to the right tablets Hard: consistency in a distributed system Testing is your friend Nothing prepares you for large-scale problems!