This (double length) hands-on session is intended to provide a high level introduction for those new to Apache Accumulo. We'll cover the following:
Accumulo architecture
How data is stored
The Accumulo shell
Accumulo's authorizations and cell-level security
Basic reading and writing of data
Pentaho Data Integration. Preparing and blending data from any source for analytics. Thus, enabling data-driven decision making. Application for education, specially, academic and learning analytics.
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
Open source is at the heart of what we do at Grafana Labs and there is so much happening! The intent of this talk to update everyone on the latest development when it comes to Grafana, Pyroscope, Faro, Loki, Mimir, Tempo and more. Everyone has had at least heard about Grafana but maybe some of the other projects mentioned above are new to you? Welcome to this talk 😉 Beside the update what is new we will also quickly introduce them during this talk.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
The monolith to cloud-native, microservices evolution has driven a shift from monitoring to observability. OpenTelemetry, a merger of the OpenTracing and OpenCensus projects, is enabling Observability 2.0. This talk gives an overview of the OpenTelemetry project and then outlines some production-proven architectures for improving the observability of your applications and systems.
Prometheus: What is is, what is new, what is comingJulien Pivotto
Prometheus is a metrics-based monitoring and alerting system and also the project with the second longest tenure within the CNCF. As such you have probably heard about it by now. We will give you a short introduction to Prometheus, what it is and why it was such a big deal when it was initially released. In all those years since then, the project has only gained speed, which provides us with the opportunity to tell you about all the exciting new features that have just been released or are in the pipeline, including opportunities to contribute to the project and its wider ecosystem.
Talk at kubecon 2021
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
More Related Content
Similar to Accumulo Tutorial — Up and Running (or at Least Walking) in 90 Minutes
Pentaho Data Integration. Preparing and blending data from any source for analytics. Thus, enabling data-driven decision making. Application for education, specially, academic and learning analytics.
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
Open source is at the heart of what we do at Grafana Labs and there is so much happening! The intent of this talk to update everyone on the latest development when it comes to Grafana, Pyroscope, Faro, Loki, Mimir, Tempo and more. Everyone has had at least heard about Grafana but maybe some of the other projects mentioned above are new to you? Welcome to this talk 😉 Beside the update what is new we will also quickly introduce them during this talk.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
The monolith to cloud-native, microservices evolution has driven a shift from monitoring to observability. OpenTelemetry, a merger of the OpenTracing and OpenCensus projects, is enabling Observability 2.0. This talk gives an overview of the OpenTelemetry project and then outlines some production-proven architectures for improving the observability of your applications and systems.
Prometheus: What is is, what is new, what is comingJulien Pivotto
Prometheus is a metrics-based monitoring and alerting system and also the project with the second longest tenure within the CNCF. As such you have probably heard about it by now. We will give you a short introduction to Prometheus, what it is and why it was such a big deal when it was initially released. In all those years since then, the project has only gained speed, which provides us with the opportunity to tell you about all the exciting new features that have just been released or are in the pipeline, including opportunities to contribute to the project and its wider ecosystem.
Talk at kubecon 2021
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
3. people. technology. integrity
Welcome!
● Accumulo is a sparse, distributed, sorted, multi-dimensional map
● Modeled after Google’s BigTable design, but with security features
built in
● Designed to scale linearly to trillions of records and 10s of
petabytes
● Features automatic load balancing, high-availability, fault-tolerance,
and dynamic data model
What is Accumulo?
15 October 2018 3
4. people. technology. integrity
Welcome!
● Data model features cell level security
● Stands up to sustained, high-volume ingest
● Data lifecycle management
● Powerful server-side programming features for data manipulation
(iterators)
Why Accumulo?
15 October 2018 4
7. people. technology. integrity
Welcome!
● Stores data as key/value pairs, sorted in lexicographically
ascending order
● All elements are represented as byte arrays, except...
● Timestamp is a long, sorted in descending order
● Tables consist of a set of sorted key/value pairs
Data Model
15 October 2018 7
Key
Row ID
Column
Family Qualifier Visibility
Timestamp
Value
8. people. technology. integrity
Data Layout
15 October 2018 8
Row
ID
Column
Family
Column
Qualifier
Column
Visibility
Timestamp Value
bob attribute dateOfBirth public 1539193308999 May 9th, 1985
bob attribute height public 1539193309131 71”
bob insurance dental private 1539193309285 MetLife
jane attribute bloodType public 1539193309362 ab-
jane attribute surname public 1539193309512 doe
jane contact cellPhone public 1539193309622 (808) 345-9876
jane insurance vision private 1539193309635 VSP
john allergy major private 1539193309650 amoxicillin
john attribute weight public 1539193309673 180
john contact homeAddr public 1539193309683 34 Baker LN
9. people. technology. integrity
Hierarchical Decomposition
15 October 2018 9
Row ID bob jane
Column
Family
attribute insurance attribute contact insurance
Column
Qualifier
dateOfBirth height dental bloodType surname cellPhone vision
Value May 9, 1985 5’11” MetLife ab- doe
(808)
345-9876
VSP
11. people. technology. integrity
Welcome!
Shell Commands
15 October 2018 11
Command Description
help Provide information about the available commands
tables Display a list of all existing tables
table Switch to the specified table
createtable Creates a new table with the specified name
deletetable Deletes the table with the specified name
insert Inserts a record
scan Scans the table and displays the resulting records
delete Deletes a record from the table
getauths Displays the maximum scan authorizations for a user
setauths sets the maximum scan authorizations for a user
quit Exits the shell
13. people. technology. integrity
Welcome!
● Get users, tables
● Create some tables
● Select the table you want
● accumulo.* tables are managed and used internally by Accumulo,
not to be altered directly by users
Users/Tables
15 October 2018 13
14. people. technology. integrity
Welcome!
● Insert records into the current table
● See file containing insert statements on VM desktop, copy & paste
● Usage: insert row column_family column_qualifier value -l
column_visibility
Inserting Records
15 October 2018 14
15. people. technology. integrity
Welcome!
Column Visibility Syntax
15 October 2018 15
Label Description
A & B Both ‘A’ and ‘B’ are required
A | B Either ‘A’ or ‘B’ is required
A & (C | B) ‘A’ and ‘C’ or ‘A’ and ‘B’ is required
A | (B & C) ‘A’ or ‘B’ and ‘C’ is required
● Combinations of visibility labels can be set on records by
passing them into an insert with logical operators
16. people. technology. integrity
Welcome!
● Try a scan
● Nothing comes up: no authorizations have been set on the current
user
● Set authorizations on root user
● User’s auths are returned in sorted order
Scans/Authorizations
15 October 2018 16
17. people. technology. integrity
Welcome!
● Scan with authorizations set
● All inserted records show up
● displayed as: row_id col_family:col_qualifier [visibility] value
Scans/Authorizations
15 October 2018 17
18. people. technology. integrity
Welcome!
● Limit which of a user’s authorizations are used for a scan
● Scanning with different authorizations returns different result sets
Scans/Authorizations
15 October 2018 18
21. people. technology. integrity
● Available at http://localhost:9995
● Interface for monitoring the health and status of Accumulo components
● Displays various metrics relating to function of master server, tablet
server, and tables
Accumulo Monitor Web Interface
15 October 2018 21
23. people. technology. integrity
Welcome!
● Written in Java and uses the
Hadoop Distributed File System
(HDFS) and Apache ZooKeeper
to store table data and
configuration data
● Supports fault tolerance through
replication of tables
● Supports high availability
through automatic failover
Architecture
15 October 2018 23
26. people. technology. integrity
Welcome!
Minor Compaction
● Once the MemTable reaches a
certain size the tablet server
writes out the sorted key/value
pairs to a file in HDFS called an
RFile
● Each Minor Compaction creates
one new RFile
● After the RFile is written to HDFS,
a new MemTable is then created
and the compaction is recorded in
the Write-Ahead Log
Tablet Writes
15 October 2018 26
27. people. technology. integrity
Welcome!
Major Compaction
● Periodically, the tablet server
will combine a set of RFiles to
minimize the number of RFiles
tracked per tablet
● During the merge, any
key/value pairs suppressed by
a delete entry are omitted
● After a Major Compaction
occurs the previous RFiles are
set to be cleaned up by the
Garbage Collector
Tablet Writes
15 October 2018 27
29. people. technology. integrity
Welcome!
● All operations on accumulo need a reference to the Accumulo
Connector
● Provides a way to perform actions as a user
● Connect to a running instance of ZooKeeper
Connector
15 October 2018 29
30. people. technology. integrity
Welcome!
● A mutation holds all changes to a table associated with a particular row
ID
● Add desired writes to a mutation
Mutation
15 October 2018 30
31. people. technology. integrity
Welcome!
● BatchWriter takes mutations and writes to accumulo via connector
● Typically add multiple mutations to a batchwriter at a time
○ Can add single mutation using writer.addMutation(m)
● Batches together data to amortize network communication overhead
BatchWriter
15 October 2018 31
32. people. technology. integrity
Welcome!
● Use a Scanner to read from a table
● Returns an iterable of results that match the scan
● BatchScanner will scan multiple ranges in parallel: faster for data
spread out on many servers, but doesn’t guarantee ordering of results
Scanner
15 October 2018 32
34. people. technology. integrity
Welcome!
● VM Desktop → Exercises/exercise-1
○ Your choice of IDE
● Tasks:
○ Add a Mutation for each piece of information we want to
store from a tweet
○ Add mutations to BatchWriter
Accumulo Writer
15 October 2018 34
36. people. technology. integrity
Welcome!
● VM Desktop → Exercises/exercise-2
● Tasks:
○ Configure a ZooKeeperInstance and Connector
○ Check to make sure TwitterData table exists
○ Create and configure a Scanner
○ Iterate over the Scanner and display each tweet’s ID and text
Accumulo Reader
15 October 2018 36
39. people. technology. integrity
Welcome!
● Highly consistent
○ Tablets are assigned to exactly one tablet server at a time which
means updates are immediately reflected in reads
● Horizontally Scalable
○ Machines added to the cluster begin participating almost
immediately
● High Failure Tolerance
○ If a tablet server fails, the master just assigns the tablets to a new
node
Additional Features
15 October 2018 39
40. people. technology. integrity
Welcome!
● MapReduce
○ First class support for MapReduce
● Iterators
○ Powerful server-side programming construct
● Data Lifecycle Management
○ Advanced data lifecycle management for compliance with
legal/policy requirements
Additional Features
15 October 2018 40
41. people. technology. integrity
Welcome!
● Tweet Indexer
○ MapReduce program to create an inverted index using Twitter
table from exercise 1
○ /home/clearedge/Desktop/Exercises/exercise-3
● Tweet Searcher
○ Find all tweets containing words passed in as arguments using
inverted index from exercise 3
○ /home/clearedge/Desktop/Exercises/exercise-4
Take Home Exercises
15 October 2018 41
42. people. technology. integrity
Related Projects
15 October 2018 42
● Datawave – an ingest/query framework that leverages Accumulo to
provide fast, secure data access
○ https://code.nsa.gov/datawave/
● Nifi – automated data flow between systems with custom processing
capabilities
○ https://nifi.apache.org/
● Fluo – large scale incremental processing built atop Accumulo
○ https://fluo.apache.org
● Timely – time series database application that provides secure access to
time series data
○ https://code.nsa.gov/timely/