LocationTech GeoMesa is a project that builds on open-source, distributed databases like Accumulo, HBase, and Cassandra to scale up indexing, querying, and analyzing billions of spatio-temporal data points. GeoMesa uses space-filling curves to index multi-dimensional data in Accumulo, and we'll discuss recent improvements for non-point geometries. Over the two and a half years GeoMesa has been an open-source project, GeoMesa's Accumulo schemas have evolved and our team has had a chance to work through creating and optimizing custom Accumulo iterators. These custom iterators allow for better query performance and interesting aggregations. GeoMesa provides support for distributed processing in Spark via MapReduce input and output formats that extend their Accumulo counterparts. We will discuss the performance benefit gained by reducing the number of default map/Spark tasks created for complex query patterns. The talk will conclude with updates about GeoMesa's integration with Jupyter notebook and improvements to GeoMesa's Spark integration.
– Speaker –
Dr. James Hughes
Mathematician, Commonwealth Computer Research, Inc (CCRi)
Dr. James Hughes is a mathematician at Commonwealth Computer Research, Inc. in Charlottesville, Virginia. He is a core committer for GeoMesa which leverages Accumulo and other distributed database systems to provide distributed computation and query engines. He is a LocationTech committer for GeoMesa, SFCurve, and GeoBench. He serves on the LocationTech Project Management Committee and Steering Committee. Through work with LocationTech and OSGeo projects like GeoTools and GeoServer, he works to build end-to-end solutions for big spatio-temporal problems. He holds a PhD in algebraic topology from the University of Virginia.
— More Information —
For more information see http://www.accumulosummit.com/
Accumulo Collections is a lightweight library that dramatically simplifies development of fast NoSQL applications by encapsulating many powerful, distributed features of Accumulo in the familiar Java Collections interface. Accumulo is a giant sorted map with rich server-side functionality, and our AccumuloSortedMap is a robust java SortedMap implementation that is backed by an Accumulo table. It handles serialization and foreign keys, and provides extensive server-side features like entry timeout, aggregates, filtering, efficient one-to-many mapping, partitioning and sampling. Users can define custom server-side transformations and aggregates with Accumulo iterators.
More information on this project can be found on github at: https://github.com/isentropy/accumulo-collections/wiki
– Speaker –
Jonathan Wolff
Founder, Director of Engineering, Isentropy LLC
Jonathan is an ex-physicist who operates a consultancy specializing in big data and data science project work. He worked for Bloomberg last year and built their Accumulo File System, which was presented as 2015 Accumulo Summit's keynote speech. He's also done distributed computing project work for Yahoo! in Pig.
Jonathan holds a BA in Physics (Harvard, magna cum laude 2001) and an MS in Mechanical Engineering (Columbia, 2003), and has been avidly programming since the 1980's.
— More Information —
For more information see http://www.accumulosummit.com/
LocationTech is an Eclipse Foundation industry working group for location aware technologies. This presentation introduces LocationTech, looks at what it means for our industry and the participating projects.
Libraries: JTS Topology Suite is the rocket science of GIS providing an implementation of Geometry. Mobile Map Tools provides a C++ foundation that is translated into Java and Javascript for maps on iOS, Andriod and WebGL. GeoMesa is a distributed key/value store based on Accumulo. Spatial4j integrates with JTS to provide Geometry on curved surface.
Process: GeoTrellis real-time distributed processing used scala, akka and spark. GeoJinni mixes spatial data/indexing with Hadoop.
Applications: GEOFF offers OpenLayers 3 as a SWT component. GeoGit distributed revision control for feature data. GeoScipt brings spatial data to Groovy, JavaScript, Python and Scala. uDig offers an eclipse based desktop GIS solution.
Attend this presentation if want to know what LocationTech is about, are interested in these projects or curious about what projects will be next.
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
Kyle Foreman presented on using Spark for large-scale global health simulations. The talk discussed (1) the motivation for simulations to model disease burden forecasts and alternative scenarios, (2) SimBuilder for constructing modular simulation workflows as directed acyclic graphs, and (3) benchmarks showing Spark backends can efficiently distribute simulations across a cluster. Future work aims to optimize Spark DataFrame joins and take better advantage of Numpy's vectorization for panel data simulations.
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
D4M is a software tool that connects scientists with big data technologies like Apache Accumulo. The D4M-Accumulo binding provides high performance connectivity to Accumulo for quick analytic prototyping. Current research looks to implement GraphBLAS server-side iterators and operators on Accumulo tables to support high performance graph analytics.
Hopsworks - ExtremeEarth Open WorkshopExtremeEarth
This document summarizes a presentation about the three-year ExtremeEarth project. It discusses the ExtremeEarth platform architecture, which brings together Earth observation data access from DIASes, end-user products from TEPs, and scalable AI capabilities from Hopsworks. The architecture provides infrastructure on Creodias and uses Hopsworks to develop end-to-end machine learning pipelines for processing petabytes of Earth observation data. Results have been exploited through additional research projects and a product offering on Hopsworks.ai. The project has also led to several publications and blog posts about applying AI to Earth observation data.
A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx
This paper presents a time-energy performance analysis of MapReduce workloads on heterogeneous systems with GPUs. The authors evaluate three MapReduce applications on a Hadoop-CUDA framework using a novel lazy processing technique that requires no modifications to the underlying Hadoop framework. Their results show that heterogeneous systems with GPUs can achieve similar execution times as traditional CPU-only clusters while realizing energy savings of up to two-thirds. This finding indicates that heterogeneous systems with integrated GPUs have potential for improving the energy efficiency of big data analytics.
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
The document proposes a system called Twiche that uses caching to improve the efficiency of incremental MapReduce jobs. Twiche indexes cached items from the map phase by their original input and applied operations. This allows it to identify duplicate computations and avoid reprocessing the same data. The experimental results show that Twiche can eliminate all duplicate tasks in incremental MapReduce jobs, reducing execution time and CPU utilization compared to traditional MapReduce.
Accumulo Collections is a lightweight library that dramatically simplifies development of fast NoSQL applications by encapsulating many powerful, distributed features of Accumulo in the familiar Java Collections interface. Accumulo is a giant sorted map with rich server-side functionality, and our AccumuloSortedMap is a robust java SortedMap implementation that is backed by an Accumulo table. It handles serialization and foreign keys, and provides extensive server-side features like entry timeout, aggregates, filtering, efficient one-to-many mapping, partitioning and sampling. Users can define custom server-side transformations and aggregates with Accumulo iterators.
More information on this project can be found on github at: https://github.com/isentropy/accumulo-collections/wiki
– Speaker –
Jonathan Wolff
Founder, Director of Engineering, Isentropy LLC
Jonathan is an ex-physicist who operates a consultancy specializing in big data and data science project work. He worked for Bloomberg last year and built their Accumulo File System, which was presented as 2015 Accumulo Summit's keynote speech. He's also done distributed computing project work for Yahoo! in Pig.
Jonathan holds a BA in Physics (Harvard, magna cum laude 2001) and an MS in Mechanical Engineering (Columbia, 2003), and has been avidly programming since the 1980's.
— More Information —
For more information see http://www.accumulosummit.com/
LocationTech is an Eclipse Foundation industry working group for location aware technologies. This presentation introduces LocationTech, looks at what it means for our industry and the participating projects.
Libraries: JTS Topology Suite is the rocket science of GIS providing an implementation of Geometry. Mobile Map Tools provides a C++ foundation that is translated into Java and Javascript for maps on iOS, Andriod and WebGL. GeoMesa is a distributed key/value store based on Accumulo. Spatial4j integrates with JTS to provide Geometry on curved surface.
Process: GeoTrellis real-time distributed processing used scala, akka and spark. GeoJinni mixes spatial data/indexing with Hadoop.
Applications: GEOFF offers OpenLayers 3 as a SWT component. GeoGit distributed revision control for feature data. GeoScipt brings spatial data to Groovy, JavaScript, Python and Scala. uDig offers an eclipse based desktop GIS solution.
Attend this presentation if want to know what LocationTech is about, are interested in these projects or curious about what projects will be next.
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
Kyle Foreman presented on using Spark for large-scale global health simulations. The talk discussed (1) the motivation for simulations to model disease burden forecasts and alternative scenarios, (2) SimBuilder for constructing modular simulation workflows as directed acyclic graphs, and (3) benchmarks showing Spark backends can efficiently distribute simulations across a cluster. Future work aims to optimize Spark DataFrame joins and take better advantage of Numpy's vectorization for panel data simulations.
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
D4M is a software tool that connects scientists with big data technologies like Apache Accumulo. The D4M-Accumulo binding provides high performance connectivity to Accumulo for quick analytic prototyping. Current research looks to implement GraphBLAS server-side iterators and operators on Accumulo tables to support high performance graph analytics.
Hopsworks - ExtremeEarth Open WorkshopExtremeEarth
This document summarizes a presentation about the three-year ExtremeEarth project. It discusses the ExtremeEarth platform architecture, which brings together Earth observation data access from DIASes, end-user products from TEPs, and scalable AI capabilities from Hopsworks. The architecture provides infrastructure on Creodias and uses Hopsworks to develop end-to-end machine learning pipelines for processing petabytes of Earth observation data. Results have been exploited through additional research projects and a product offering on Hopsworks.ai. The project has also led to several publications and blog posts about applying AI to Earth observation data.
A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx
This paper presents a time-energy performance analysis of MapReduce workloads on heterogeneous systems with GPUs. The authors evaluate three MapReduce applications on a Hadoop-CUDA framework using a novel lazy processing technique that requires no modifications to the underlying Hadoop framework. Their results show that heterogeneous systems with GPUs can achieve similar execution times as traditional CPU-only clusters while realizing energy savings of up to two-thirds. This finding indicates that heterogeneous systems with integrated GPUs have potential for improving the energy efficiency of big data analytics.
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
The document proposes a system called Twiche that uses caching to improve the efficiency of incremental MapReduce jobs. Twiche indexes cached items from the map phase by their original input and applied operations. This allows it to identify duplicate computations and avoid reprocessing the same data. The experimental results show that Twiche can eliminate all duplicate tasks in incremental MapReduce jobs, reducing execution time and CPU utilization compared to traditional MapReduce.
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
Apache Spark 2.2 shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Skewed data distributions are often inherent in many real world applications. In order to deal with skewed distributions effectively, we added equal-height histograms to Apache Spark 2.3. Leveraging reliable statistics and histogram helps Spark make better decisions in picking the most optimal query plan for real world scenarios.
In this talk, we’ll take a deep dive into how Spark’s Cost-Based Optimizer estimates the cardinality and size of each database operator. Specifically, for skewed distribution workload such as TPC-DS, we will show histogram’s impact on query plan change, hence leading to performance gain.
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
Hadoop and the term 'Big Data' go hand in hand. The information explosion caused
due to cloud and distributed computing lead to the curiosity to process and analyze massive
amount of data. The process and analysis helps to add value to an organization or derive
valuable information.
The current Hadoop implementation assumes that computing nodes in a cluster are
homogeneous in nature. Hadoop relies on its capability to take computation to the nodes
rather than migrating the data around the nodes which might cause a signicant network
overhead. This strategy has its potential benets on homogeneous environment but it might
not be suitable on an heterogeneous environment. The time taken to process the data on a
slower node on a heterogeneous environment might be signicantly higher than the sum of
network overhead and processing time on a faster node. Hence, it is necessary to study the
data placement policy where we can distribute the data based on the processing power of
a node. The project explores this data placement policy and notes the ramications of this
strategy based on running few benchmark applications.
This document provides an overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop was developed based on Google's MapReduce algorithm and how it uses HDFS for scalable storage and MapReduce as an execution engine. Key components of Hadoop architecture include HDFS for fault-tolerant storage across data nodes and the MapReduce programming model for parallel processing of data blocks. The document also gives examples of how MapReduce works and industries that use Hadoop for big data applications.
Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
GeoMesa is an open-source project that provides scalable geospatial analytics on large datasets. It allows querying and analyzing data stored in Apache Accumulo using a geospatial index. GeoMesa implements the GeoTools API and supports point, line, polygon, raster, and time-enabled data through flexible space-filling curves. It enables distributed computation and analytics through features like multi-step query planning, secondary indexes, and integration with frameworks like Spark and streaming APIs. The project is developed and supported by a community including LocationTech.
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkSafir Shah
This document proposes Dache, a data-aware caching system for big data applications using the MapReduce framework. It aims to extend MapReduce by provisioning a cache layer to efficiently identify and access cached items. The proposed system identifies input sources and operations applied to cache items for proper indexing. It describes cache requests and replies for the map and reduce phases. Experimental results show the proposed system eliminates duplicate tasks, reduces execution time and CPU utilization compared to traditional MapReduce.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
The document discusses how ArcGIS can be used to ingest, visualize, analyze, and share scientific data stored in formats like netCDF, HDF, and GRIB, including directly reading these files, creating multidimensional mosaics for aggregation, analyzing spatial and temporal patterns, publishing services and maps, and extending capabilities through Python tools and custom geoprocessing. ArcGIS supports the full scientific data workflow from ingesting data to sharing final results and apps on the web and with other platforms like WMS and Dapple Earth Explorer.
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - StratosphereEuropean Data Forum
Stratosphere is a collaborative research project between universities to build an open-source platform for big data analytics. It bridges relational databases and MapReduce using a functional programming language called Meteor. The platform includes data pools, tools for data linkage and analysis, and a scalable execution engine called Nephele. Stratosphere is optimized for parallelism using its PACT programming model and optimizer. Ongoing work focuses on UDFs, caching, and advancing the MapReduce paradigm.
1) Stratosphere is a distributed data processing system that extends the MapReduce model by supporting more operators and advanced data flow graphs composed of operators.
2) It has components like a query parser, compiler, and optimizer that translate queries into execution plans composed of operators like Map, Reduce, Join, Cross, CoGroup, and Union.
3) Stratosphere supports arbitrary data flows while MapReduce only supports MapReduce, and Stratosphere has better performance through in-memory processing and pipelining compared to MapReduce which always writes to disk.
This document introduces MapReduce, including its architecture, advantages, frameworks for writing MapReduce programs, and an example WordCount MapReduce program. It also discusses how to compile, deploy, and run MapReduce programs using Hadoop and Eclipse.
The document outlines the anatomy of MapReduce applications including common phases like input splitting, mapping, shuffling, and reducing. It then provides high-level and low-level views of how a word counting MapReduce job works, explaining that it takes a text corpus as input, maps words to counts of 1, shuffles to reduce by word, and outputs final word counts. The map and reduce functions are explained at a high-level, and then implementation details like MapRunner, RecordReader, and OutputCollector are described at a lower level.
Applying stratosphere for big data analyticsAvinash Pandu
Stratosphere is a next-generation data analytics platform that can perform complex operations like JOIN, CROSS, and GROUPS more efficiently than traditional MapReduce. It uses MapReduce as its basic building block but introduces optimizations that reduce computational time. Stratosphere supports a query language called Meteor and can execute analytical tasks formulated as Meteor queries using its distributed processing capabilities.
This document discusses using Python for Hadoop and data mining. It introduces Dumbo, which allows writing Hadoop programs in Python. K-means clustering in MapReduce is also covered. Dumbo provides a Pythonic API for MapReduce and allows extending Hadoop functionality. Examples demonstrate implementing K-means in Dumbo and optimizing it by computing partial centroids locally in mappers. The document also lists Python books and tools for data mining and scientific computing.
This document describes a parallel and scalable approach called Big-SeqSB-Gen for generating large synthetic sequence databases. It implements Whitney enumerators to generate distinct sequences and uses a parallel sequence generator (PSG) built on Hadoop MapReduce. The PSG was tested on a French Grid5000 cluster and achieved generation of over 18 billion sequences in under 2 hours, demonstrating good scalability and throughput. Future work involves mining patterns from large real sequence datasets.
This document discusses supporting parallel OLAP (online analytical processing) over big data. It presents different data partitioning schemes for distributed warehouses and evaluates their performance using the TPC-H benchmark. Experimental results show improved query response times when fragmenting and distributing tables over multiple database backends compared to a single backend. The authors also introduce derived data techniques to further optimize query performance. They conclude more work is needed to automate data partitioning and support larger datasets.
This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.
Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...Accumulo Summit
Talk Abstract
GeoWave is an open source software project developed at the National Geospatial-Intelligence Agency (NGA) in collaboration with Booz Allen Hamilton and RadiantBlue Technologies. GeoWave leverages Accumulo’s architecture to manage petabytes of raster and vector data by serving as an enterprise level geospatial data store. To efficiently index geospatial data and answer queries with geospatial constraints, GeoWave employs a space filling curve to form bidirectional mappings between multi-dimensional data and Accumulo’s sorted row identifiers. As a complete offering, Geowave provides a plug-in to the Open Source Geospatial Foundation’s GeoServer platform, enabling management of geospatial data and associated attributes through Open Geospatial Consortium (OGC) standard services, and map-reduce input/output formats to support scalable post-processing and analysis of geospatial data.
Speakers
Eric Robertson
Lead Technologist, Booz Allen Hamilton
Eric Robertson is a Data Scientist at Booz Allen Hamilton and has over twenty years of experience in software development across many diverse vertical domains including telecommunication, pharmaceuticals, finance, economics and defense. Eric has extensive experience in designing and developing identity correlation systems using graph analytics. Eric holds a M.S. in Computer Science from University of Maryland Baltimore County. Eric's current interests include machine learning and linear programming.
Rich Fecher
Senior Software Engineer, RadiantBlue
Over the past 10 years, Rich Fecher has been solving the hard technical challenges that face the U.S. Defense and Intelligence Communities. Rich has extensive expertise in architecting and building end-to-end systems. His experience ranges from visualization to distributed computing, and he has primarily focused his career toward enriching geospatial content and delivery. Rich holds a M.S. in Computer Science from George Mason University; he received his post-graduate certificate in GIS from Pennsylvania State University, and received a B.S. in Computer Science with minors in Applied Math and Physics from the University of Virginia.
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...Accumulo Summit
Speaker: Aaron Cordova
Most users of Accumulo start developing applications on a single machine and will to scale to up to four orders of magnitude more machines without having to rewrite. In this talk we describe techniques for designing applications for scale, planning a large scale cluster, tuning the cluster for high speed ingest, dealing with a large amount of data over time, and unique features of Accumulo for taking advantage of up to ten thousand nodes in a single instance. We also include the largest public metrics gathered on Accumulo clusters to date and include a discussion of overcoming practical limits to scaling in the future.
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
Apache Spark 2.2 shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Skewed data distributions are often inherent in many real world applications. In order to deal with skewed distributions effectively, we added equal-height histograms to Apache Spark 2.3. Leveraging reliable statistics and histogram helps Spark make better decisions in picking the most optimal query plan for real world scenarios.
In this talk, we’ll take a deep dive into how Spark’s Cost-Based Optimizer estimates the cardinality and size of each database operator. Specifically, for skewed distribution workload such as TPC-DS, we will show histogram’s impact on query plan change, hence leading to performance gain.
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
Hadoop and the term 'Big Data' go hand in hand. The information explosion caused
due to cloud and distributed computing lead to the curiosity to process and analyze massive
amount of data. The process and analysis helps to add value to an organization or derive
valuable information.
The current Hadoop implementation assumes that computing nodes in a cluster are
homogeneous in nature. Hadoop relies on its capability to take computation to the nodes
rather than migrating the data around the nodes which might cause a signicant network
overhead. This strategy has its potential benets on homogeneous environment but it might
not be suitable on an heterogeneous environment. The time taken to process the data on a
slower node on a heterogeneous environment might be signicantly higher than the sum of
network overhead and processing time on a faster node. Hence, it is necessary to study the
data placement policy where we can distribute the data based on the processing power of
a node. The project explores this data placement policy and notes the ramications of this
strategy based on running few benchmark applications.
This document provides an overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop was developed based on Google's MapReduce algorithm and how it uses HDFS for scalable storage and MapReduce as an execution engine. Key components of Hadoop architecture include HDFS for fault-tolerant storage across data nodes and the MapReduce programming model for parallel processing of data blocks. The document also gives examples of how MapReduce works and industries that use Hadoop for big data applications.
Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
GeoMesa is an open-source project that provides scalable geospatial analytics on large datasets. It allows querying and analyzing data stored in Apache Accumulo using a geospatial index. GeoMesa implements the GeoTools API and supports point, line, polygon, raster, and time-enabled data through flexible space-filling curves. It enables distributed computation and analytics through features like multi-step query planning, secondary indexes, and integration with frameworks like Spark and streaming APIs. The project is developed and supported by a community including LocationTech.
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkSafir Shah
This document proposes Dache, a data-aware caching system for big data applications using the MapReduce framework. It aims to extend MapReduce by provisioning a cache layer to efficiently identify and access cached items. The proposed system identifies input sources and operations applied to cache items for proper indexing. It describes cache requests and replies for the map and reduce phases. Experimental results show the proposed system eliminates duplicate tasks, reduces execution time and CPU utilization compared to traditional MapReduce.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
The document discusses how ArcGIS can be used to ingest, visualize, analyze, and share scientific data stored in formats like netCDF, HDF, and GRIB, including directly reading these files, creating multidimensional mosaics for aggregation, analyzing spatial and temporal patterns, publishing services and maps, and extending capabilities through Python tools and custom geoprocessing. ArcGIS supports the full scientific data workflow from ingesting data to sharing final results and apps on the web and with other platforms like WMS and Dapple Earth Explorer.
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - StratosphereEuropean Data Forum
Stratosphere is a collaborative research project between universities to build an open-source platform for big data analytics. It bridges relational databases and MapReduce using a functional programming language called Meteor. The platform includes data pools, tools for data linkage and analysis, and a scalable execution engine called Nephele. Stratosphere is optimized for parallelism using its PACT programming model and optimizer. Ongoing work focuses on UDFs, caching, and advancing the MapReduce paradigm.
1) Stratosphere is a distributed data processing system that extends the MapReduce model by supporting more operators and advanced data flow graphs composed of operators.
2) It has components like a query parser, compiler, and optimizer that translate queries into execution plans composed of operators like Map, Reduce, Join, Cross, CoGroup, and Union.
3) Stratosphere supports arbitrary data flows while MapReduce only supports MapReduce, and Stratosphere has better performance through in-memory processing and pipelining compared to MapReduce which always writes to disk.
This document introduces MapReduce, including its architecture, advantages, frameworks for writing MapReduce programs, and an example WordCount MapReduce program. It also discusses how to compile, deploy, and run MapReduce programs using Hadoop and Eclipse.
The document outlines the anatomy of MapReduce applications including common phases like input splitting, mapping, shuffling, and reducing. It then provides high-level and low-level views of how a word counting MapReduce job works, explaining that it takes a text corpus as input, maps words to counts of 1, shuffles to reduce by word, and outputs final word counts. The map and reduce functions are explained at a high-level, and then implementation details like MapRunner, RecordReader, and OutputCollector are described at a lower level.
Applying stratosphere for big data analyticsAvinash Pandu
Stratosphere is a next-generation data analytics platform that can perform complex operations like JOIN, CROSS, and GROUPS more efficiently than traditional MapReduce. It uses MapReduce as its basic building block but introduces optimizations that reduce computational time. Stratosphere supports a query language called Meteor and can execute analytical tasks formulated as Meteor queries using its distributed processing capabilities.
This document discusses using Python for Hadoop and data mining. It introduces Dumbo, which allows writing Hadoop programs in Python. K-means clustering in MapReduce is also covered. Dumbo provides a Pythonic API for MapReduce and allows extending Hadoop functionality. Examples demonstrate implementing K-means in Dumbo and optimizing it by computing partial centroids locally in mappers. The document also lists Python books and tools for data mining and scientific computing.
This document describes a parallel and scalable approach called Big-SeqSB-Gen for generating large synthetic sequence databases. It implements Whitney enumerators to generate distinct sequences and uses a parallel sequence generator (PSG) built on Hadoop MapReduce. The PSG was tested on a French Grid5000 cluster and achieved generation of over 18 billion sequences in under 2 hours, demonstrating good scalability and throughput. Future work involves mining patterns from large real sequence datasets.
This document discusses supporting parallel OLAP (online analytical processing) over big data. It presents different data partitioning schemes for distributed warehouses and evaluates their performance using the TPC-H benchmark. Experimental results show improved query response times when fragmenting and distributing tables over multiple database backends compared to a single backend. The authors also introduce derived data techniques to further optimize query performance. They conclude more work is needed to automate data partitioning and support larger datasets.
This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.
Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...Accumulo Summit
Talk Abstract
GeoWave is an open source software project developed at the National Geospatial-Intelligence Agency (NGA) in collaboration with Booz Allen Hamilton and RadiantBlue Technologies. GeoWave leverages Accumulo’s architecture to manage petabytes of raster and vector data by serving as an enterprise level geospatial data store. To efficiently index geospatial data and answer queries with geospatial constraints, GeoWave employs a space filling curve to form bidirectional mappings between multi-dimensional data and Accumulo’s sorted row identifiers. As a complete offering, Geowave provides a plug-in to the Open Source Geospatial Foundation’s GeoServer platform, enabling management of geospatial data and associated attributes through Open Geospatial Consortium (OGC) standard services, and map-reduce input/output formats to support scalable post-processing and analysis of geospatial data.
Speakers
Eric Robertson
Lead Technologist, Booz Allen Hamilton
Eric Robertson is a Data Scientist at Booz Allen Hamilton and has over twenty years of experience in software development across many diverse vertical domains including telecommunication, pharmaceuticals, finance, economics and defense. Eric has extensive experience in designing and developing identity correlation systems using graph analytics. Eric holds a M.S. in Computer Science from University of Maryland Baltimore County. Eric's current interests include machine learning and linear programming.
Rich Fecher
Senior Software Engineer, RadiantBlue
Over the past 10 years, Rich Fecher has been solving the hard technical challenges that face the U.S. Defense and Intelligence Communities. Rich has extensive expertise in architecting and building end-to-end systems. His experience ranges from visualization to distributed computing, and he has primarily focused his career toward enriching geospatial content and delivery. Rich holds a M.S. in Computer Science from George Mason University; he received his post-graduate certificate in GIS from Pennsylvania State University, and received a B.S. in Computer Science with minors in Applied Math and Physics from the University of Virginia.
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...Accumulo Summit
Speaker: Aaron Cordova
Most users of Accumulo start developing applications on a single machine and will to scale to up to four orders of magnitude more machines without having to rewrite. In this talk we describe techniques for designing applications for scale, planning a large scale cluster, tuning the cluster for high speed ingest, dealing with a large amount of data over time, and unique features of Accumulo for taking advantage of up to ten thousand nodes in a single instance. We also include the largest public metrics gathered on Accumulo clusters to date and include a discussion of overcoming practical limits to scaling in the future.
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]Accumulo Summit
Talk Abstract
Having the ability to diagnose and understand what is happening in distributed systems is essential. Tracing is one mechanism that enables analysis of operations in distributed systems by dividing each operation into a tree of measurable sub-tasks. HDFS, Accumulo, and HBase are now converging on a single tracing system utilizing HTrace, an open source tracing instrumentation library that recently became a new Apache Incubator project. This talk will cover tracing fundamentals, the instrumentation that has been added to HDFS to support tracing, and changes that have been made in Accumulo's tracing. It will also cover options for collecting and visualizing traces, as well as the current status of the HTrace podling.
Speaker
Billie Rinaldi
Sr. Member of Technical Staff, Hortonworks
Billie Rinaldi is a Senior Member of Technical Staff at Hortonworks, Inc., currently prototyping new features related to application monitoring and deployment in the Apache Hadoop ecosystem. Prior to August 2012, Billie engaged in big data science and research at the National Security Agency. Since 2008, she has been providing technical leadership regarding the software that is now Apache Accumulo. Billie is the VP of Apache Accumulo, the Accumulo Project Management Committee Chair, and a member of the Apache Software Foundation. She holds a Ph.D. in applied mathematics from Rensselaer Polytechnic Institute.
The document discusses Apache Accumulo, an open source distributed key-value store based on Google's Bigtable design. It provides an overview of Accumulo, including its timeline, strengths in security, scalability and adaptability. It describes Accumulo's basic schema of sorted key-value pairs with row, column family, qualifier, visibility and timestamp. It also outlines Accumulo's architecture, tablet organization, data flow, iterator framework and table design strategies.
Accumulo Summit 2016: Accumulo in the EnterpriseAccumulo Summit
Many organizations are looking to Hadoop clusters in order to store and manage an ever-increasing amount of data. As the volume and variety of data in these systems grows, administrators are being confronted with more information, from more sources, than they have ever seen concentrated in a single place. The responsibility for securing all this data can be daunting to an administrator, even intimidating. Could the answer lie in Accumulo?
Conventional approaches to data security usually do not suffice for this scenario. They are often coarse-grained, applying only at the file or table level. In a world where arbitrary compute tasks can be pushed into the cluster, defining a security perimeter is difficult or impossible. On the other hand, relegating access policy enforcement to the application level instead of the database level ultimately invites a security disaster.
This is the world that Chief Security Officers, Chief Information Officers, and Chief Data Officers live in, and the problem of security for big data is the single biggest impediment to delivering a Hadoop-based solution in the enterprise’s production network. Numerous organizations have implemented Hadoop as a pilot, but find themselves blocked by similar considerations when the time to move into production:
• How do you implement fine-grained access controls in a Hadoop system?
• What about encryption at rest? Encryption in motion?
• How will this tie into our identity infrastructure?
• How will this fit into our operational workflow?
This keynote will explore the ways in which Apache Accumulo is uniquely positioned to mitigate or resolve problems around access control and security for big data, thus enabling Hadoop clusters to move from pilot to production.
– Speaker –
Russ Weeks
Software Architect, PHEMI Systems
Russ Weeks is a Software Architect at PHEMI. Prior to joining PHEMI Systems, Russ worked in the network management groups at Ericsson and Cray Supercomputers, where he discovered a passion for distributed data structures and algorithms. PHEMI Systems is a Vancouver, BC-based startup focused on the storage, retention and governance of structured and unstructured data.
— More Information —
For more information see http://www.accumulosummit.com/
Aaron Cordova outlines how Accumulo helps provide the essential features of a "Data Lake": a system in which all types of data from all sources can be imported, secured, analyzed, and delivered to decision makers.
The document summarizes how Accumulo can scale to support large clusters storing petabytes of data. It discusses how Accumulo maintains low administrative effort and scan latency as the data size scales up. Key techniques for scaling Accumulo include distributing writes across all servers, designing schemas to minimize the number of scans needed, and using temporal or binned keys to parallelize writes. The document also provides estimates for planning Accumulo clusters capable of ingesting millions of entries per second and storing data in the petabyte range.
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit
Speaker: Mike Drob
Apache Accumulo has long held a reputation for enabling high-throughput operations in write-heavy workloads. In this talk, we use the Yahoo! Cloud Serving Benchmark (YCSB) to put real numbers on Accumulo performance. We then compare these numbers to previous versions, to other databases, and wrap up with a discussion of parameters that can be tweaked to improve them.
Accumulo is a distributed key-value store that runs on Hadoop clusters. It is very scalable, able to store trillions of records and petabytes of data. Accumulo provides cell-level security and was originally developed by NSA as an open source version of Google's BigTable. It uses a master node to monitor tablet servers that store and serve partitions of tables. Potential applications of Accumulo include use as a massive datastore, for graph databases, machine learning/classification using sparse feature vectors.
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit
Accumulo requires its users to trust each Accumulo installation with their data — a malicious server or user could easily compromise critical data or learn secrets they are not authorized to access. One particular threat is a malicious Accumulo server compromising data’s integrity, by tampering with query results and returning forged, modified, or incomplete results to a user. In prior work, we implemented a lightweight client-side tool to protect against this kind of threat. We now present improvements to this tool that handle a wider range of attacks by a malicious server and reduce overhead for the client.
In our solution, Accumulo clients use Authenticated Data Structures (ADSs) to verify their range queries’ integrity. ADS metadata is stored in Accumulo, so that after each query, the server must construct a proof that the query has not been tampered with. We use Accumulo iterators to compute these proofs on the server without requiring an unnecessary computational burden from the client. We will present our approach to adding ADSs to Accumulo, our schema for storing the ADS metadata, and opportunities for future work in efficiency and expressiveness.
– Speaker –
Leo St. Amour
Military Fellow, MIT Lincoln Laboratory
Leo St. Amour is a master’s student at Northeastern University and a military fellow at MIT Lincoln Laboratory. He graduated from the United States Military Academy in May 2015, where he worked on a TLS library with enhanced usability and security. In addition to his work on TLS and Accumulo, he is currently working on binary analysis, with a focus on discovering and hardening security properties.
— More Information —
For more information see http://www.accumulosummit.com/
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]Accumulo Summit
Talk Abstract
Bulk ingest enables Accumulo to import externally-prepared data into existing tables. Unlike ingest via batch writers, much of the work of organizing data can be left to external processing frameworks such as MapReduce and scaled independently of the Accumulo cluster itself. This reduces the work required of the tablet servers to support ingest, freeing resources to support other operations.
Under the hood, bulk ingest involves a number a moving parts and accounting for a variety of failure scenarios. This talk covers the components of the bulk ingest process in-depth and describes past, current and future implementations of this capability. Attendees will leave this session with an understanding of bulk ingest that will enable troubleshooting, capacity estimation and performance management.
Speaker
Eric Newton
Senior Software Developer, SWComplete
Eric Newton has been a programmer for over 30 years, and has worked on Accumulo since 2009. He has been an open-source contributor and consumer since 1988. Through the years, his distributed communications systems work has included Air Traffic Control, Systems Monitoring and Databases. Eric has started 3 of his own companies and helped several other businesses start.
GeoMesa's index uses a shard id as the beginning of the key to allow for horizontal scalability. It encodes spatio-temporal data in Accumulo keys using space filling curves. Queries are applied in parallel at scan time through stacked server side iterators to implement (E)CQL standard queries.
Processing Geospatial Data At Scale @locationtechRob Emanuele
This document discusses processing large geospatial data at scale. It provides background on big data frameworks like Apache Hadoop, Apache Spark, and geospatial projects like GeoTrellis, GeoWave, and SpatialHadoop that enable processing geospatial data using these frameworks. The document outlines how these tools allow geospatial data from sources like satellite imagery, OpenStreetMap, and geotagged social media to be analyzed using distributed computing platforms and algorithms.
We have two great organisations hosting FOSS4G this year: The Open Source Geospatial Foundation and LocationTech. Putting on a great event is not the primary responsibility of these software foundations - supporting our great open source software is!
This talk will introduce OSGeo and LocationTech, and balance the tricky topic of comparison for those interested in what each organisation offers and identifying possibilities for collaboration.
Each of these software foundations has an “incubation” process setup to onboard new projects. This incubation process matches the organization's priorities and will address many factors important to you, and few ideas you may not of considered yet.
This talks draws the incubation experience of:
* GeoServer (OSGeo), GeoTools (OSGeo),
* GeoGig (LocationTech), uDig (LocationTech)
If you are an open source developer interested in joining a foundation we will cover some of the resource, marking and infrastructure benefits that may be a factor for consideration. We will also looking into some of the long term benefits a software foundation provides both you and importantly users of your software.
If you are a team members faced with the difficult choice of selecting open source technologies this talk can help. We can learn a lot about the risks associated with open source based on how each foundation seeks to protect you. The factors a software foundation considers for its projects provide useful criteria you can use to evaluate any projects.
Sqrrl Data, Inc. is a startup company founded in July 2012 that is focused on building secure, scalable, and adaptive applications using Apache Accumulo. The company was founded by former engineers and contributors to Accumulo, including the former Tech Director of Accumulo at NSA. Sqrrl aims to develop lightweight applications for discovery analytics, targeted analysis, and big-picture analytics using Accumulo's capabilities for security, scalability, and flexibility.
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit
Talk Abstract
As with all open-source databases, Accumulo developers often compete between building exciting new features and hacking on performance and stability. As the core features solidify and expand, we see many opportunities to improve performance. An effective methodology for performance improvement is scientific in nature, and follows a well-definite modeling and simulation approach, matching theory to experimentation in an iterative fashion.
Ingest performance is one of the most differentiating characteristics of Accumulo. However, there is much room for improvement for typical ingest-heavy applications. Accumulo supports two mechanisms to bring data in: streaming ingest and bulk ingest. In bulk ingest, the goal is to maximize throughput without constraining latency. Bulk ingest involves creating a set of files that conform to Accumulo's internal RFile format and then registering those files with Accumulo. MapReduce provides a framework for generating, sorting, and storing key/value pairs, which form the primary elements of preparing RFiles for bulk ingest. MapReduce has been used many times over the years to break sorting records, such as Terasort. We can expect it is a reasonable choice for maximizing bulk ingest throughput. However, the theory often proves challenging to implement as there are many performance pitfalls along the way.
In this talk, we dive deep into optimizing MapReduce for Accumulo bulk ingest. We share detailed theoretical and empirical performance models, we discuss techniques for profiling performance, and we suggest reusable techniques for squeezing the maximum performance out of enterprise-grade Accumulo bulk ingest.
Speaker
Chris McCubbin
Director of Data Science, Sqrrl
Chris is the Director of Data Science for Sqrrl. He has extensive experience with the Hadoop ecosystem and applying scientific computation algorithms to real-world datasets. Previously, Chris developed Big Data analysis tools for the Intelligence Community and applied artificial intelligence techniques to unmanned vehicle systems. He holds a MS in Computer Science and BS in Computer Science and Mathematics from the University of Maryland.
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit
Talk Abstract
In this talk we will walk through how Apache Kafka and Apache Accumulo can be used together to orchestrate a de-coupled, real-time distributed and reactive request/response system at massive scale. Multiple data pipelines can perform complex operations for each message in parallel at high volumes with low latencies. The final result will be inline with the initiating call. The architecture gains are immense. They allow for the requesting system to receive a response without the need for direct integration with the data pipeline(s) that messages must go through. By utilizing Apache Kafka and Apache Accumulo, these gains sustain at scale and allow for complex operations of different messages to be applied to each response in real-time.
Speaker
Joe Stein
Principal Consultant, Big Data Open Source Security, LLC
Joe Stein is an Apache Kafka committer and PMC member. Joe is the Founder and Principal Architect of Big Data Open Source Security LLC a professional services and product solutions company. Joe has been a developer, architect and technologist professionally for 15 years now having built back end systems that supported over one hundred million unique devices a day processing trillions of events. He blogs and hosts a podcast about Hadoop and related systems at All Things Hadoop and tweets @allthingshadoop
Processing Geospatial at Scale at LocationTechRob Emanuele
This document discusses processing large geospatial data at scale. It provides background on geospatial concepts like raster and vector data. It then discusses big data frameworks like Hadoop, Spark, and Accumulo that can be used to process geospatial data in parallel across large clusters. Finally, it presents several LocationTech projects like GeoTrellis, GeoJinni, and GeoWave that build geospatial capabilities on top of these frameworks to allow distributed processing and querying of large raster and vector maps.
Apache Accumulo, originally developed by the National Security Agency and now an Apache Software Foundation project, builds upon Google's Bigtable design to provide a scalable, lightly-structured database capability complementing the ubiquitous Hadoop environment. The core capabilities of Accumulo include cell-level security, flexible schemas, real-time analytics, bulk I/O, and linear scalability beyond trillions of entries and petabytes of data. These new capabilities lead to techniques that unlock the power of Big Data, but don't fit into traditional database design patterns. Learn about the advantages of Apache Accumulo and how it fits into the Hadoop and NoSQL ecosystem.
Presenter: Adam Fuchs, CTO, sqrrl
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
This document summarizes a technical report describing a new multi-resolution particle data format called ADAPTER. ADAPTER uses a hierarchical k-d tree structure to store particle data at multiple resolutions, allowing for rapid access to either a large subset of data at low resolution or a small subset at full resolution, without increasing storage requirements. The format is designed to enable efficient exploration and analysis of very large particle datasets in the range of terabytes to petabytes on desktop computers. It aims to address limitations of existing formats in supporting adaptive spatial indexing, multi-resolution access, and set operations for extracting and merging subsets of data at different resolutions.
This is a slide deck that I have been using to present on GeoTrellis for various meetings and workshops. The information is speaks to GeoTrellis pre-1.0 release in Q4 of 2016.
The document discusses various measures to improve back-end, front-end, and data model performance in Informatica and databases. For back-end performance, it recommends avoiding certain transformations like joiners and aggregators when possible, using indexes and transferring filters to source qualifiers. For databases, it suggests using clusters, partitioning, parallelism, and dynamic tuning. For front-end performance, it recommends indexes, aggregate tables, and dimensional modeling with global dimensions and translation tables.
The document discusses various measures to improve back-end, front-end, and data model performance in Informatica and databases. For back-end performance, it recommends avoiding certain transformations like joiners and aggregators when possible, using indexes and transferring filters to source qualifiers. For databases, it suggests using clusters, partitioning, parallelism, and dynamic tuning. For front-end performance, it recommends indexes, aggregate tables, and dimensional modeling with global dimensions and translation tables.
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
During the rise and innovation of “big data,” the geospatial analytics landscape has grown and evolved. We are beyond just analyzing static maps. Geospatial data is streaming from devices, sensors, infrastructure systems, or social media, and our applications and use cases must dynamically scale to meet the increased demands.
Cloud can provide cost-effective storage and that ephemeral resource-burst needed for fast processing and low latency, all to monetize the immediate value of fresh geospatial data. Geospatial analytics require optimized spatial data types and algorithms to distill data to knowledge. Such processing, especially with strict latency requirements, has always been a challenge.
We propose an open source big data stack for geospatial analytics on Cloud based on Apache NiFi, Apache Spark and LocationTech GeoMesa. GeoMesa is a geospatial framework deployed in a modern big data platform that provides a scalable and low latency solution for indexing volumes of historical data and generating live views and streaming geospatial analytics. CONSTANTIN STANCA, Solutions Engineer, Hortonworks and JAMES HUGHES, Mathematician, CCRi
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"Guy K. Kloss
This document discusses the development of a grid data infrastructure called MataNui to manage large amounts of observational astronomical data and metadata from a collaboration between researchers in New Zealand and Japan. The infrastructure uses existing open-source tools like MongoDB, GridFTP, and the DataFinder GUI client to allow distributed storage and access of data while meeting requirements like handling large data volumes, metadata, and remote access. This approach provides a robust, reusable, and user-friendly system to address common data management challenges in scientific collaborations.
Research on vector spatial data storage scheme basedAnant Kumar
The document proposes a novel vector spatial data storage schema based on Hadoop to address problems with managing large-scale spatial data in cloud computing. It designs a vector spatial data storage scheme using column-oriented storage and key-value mapping to represent topological relationships. It also develops middleware to directly store spatial data and enable geospatial data access using the GeoTools toolkit. Experiments on a Hadoop cluster demonstrate the proposal is efficient and applicable for large-scale vector spatial data storage and expression of spatial relationships.
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
1) Postgres and PostGIS have been used at EDINA for over 8 years to power major geospatial services like Digimap.
2) It is used for data storage, mapping, spatial indexing, querying, and data downloads. Postgres allows EDINA to handle large amounts of geospatial data and large user bases.
3) EDINA finds Postgres reliable, performant, scalable, and standards-compliant with good support tools. It will continue being the core database for EDINA's geoservices.
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
This document discusses frameworks for processing big data that is distributed across geographic locations. It begins by introducing the challenges of geo-distributed big data processing and then describes several MapReduce-based frameworks like G-Hadoop and G-MR that can process pre-located geo-distributed data. It also covers Spark-based systems like Iridium and frameworks that partition data across geographic locations, such as KOALA grid-based systems. The document analyzes key aspects of geo-distributed big data processing systems like data distribution, task scheduling, and fault tolerance.
This document discusses challenges in processing large graphs and introduces an approach called GraphLego. It describes how GraphLego models large graphs as 3D cubes partitioned into slices, strips and dices to balance parallel computation. GraphLego optimizes access locality by minimizing disk access and compressing partitions. It also uses regression-based learning to optimize partitioning parameters and runtime. The document evaluates GraphLego on real-world graphs, finding it outperforms existing single-machine graph processing systems in execution efficiency and partitioning decisions.
The document summarizes research on parallelizing genetic algorithms to improve scalability when solving concept location problems. Four distributed architectures were developed and tested: 1) a simple client-server model with no data sharing, 2) a database configuration, 3) a hash-database configuration, and 4) a hash configuration where each server caches its own data locally. Experimental results showed the hash configuration performed best, reducing computation time by over 140 times compared to a single machine by efficiently storing and accessing already-computed data locally on each server. Future work aims to test different communication protocols and problems to validate the findings.
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
The document reviews existing methods for the k-means clustering algorithm. It discusses how k-means clustering works and some of its limitations when dealing with large datasets, such as being dependent on the initial choice of centroids. It then proposes using Hadoop to overcome big data challenges and calculate preliminary centroids for k-means clustering in a distributed manner. Finally, it reviews different techniques that have been proposed in other research to improve k-means clustering, such as methods for selecting better initial centroids or determining the optimal number of clusters.
For a new better version of this tutorial see my Google Slides with embedded videos.
https://docs.google.com/presentation/d/1MftEOT3uvYpCVwUaLMhsesm5Que-Kr7GQRV4pKZ2SNQ/edit?usp=sharing
This is a 2019 tutorial on how to do watershed delineation using ArcMap 10. It is an open education resource. Please let me know if you find it useful or see something that could be improved. Feel free to use it for teaching Geographic Information Science.
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
The document discusses accelerating Reed-Solomon erasure codes on GPUs. It aims to accelerate two main computation bottlenecks: arithmetic operations in Galois fields and matrix multiplication. For Galois field operations, it evaluates loop-based and table-based methods and chooses a log-exponential table approach. It also proposes tiling algorithms to optimize matrix multiplication on GPUs by reducing data transfers and improving memory access patterns. The goal is to make Reed-Solomon encoding and decoding faster for cloud storage systems using erasure codes.
Skyline Query Processing using Filtering in Distributed EnvironmentIJMER
This document summarizes a research paper about skyline query processing in distributed databases. Skyline queries return multidimensional data points that are not dominated by other points. In distributed databases, skyline queries must be processed across multiple data sites. The paper proposes using multiple filtering points selected from each local skyline result to reduce the number of false positive results and communication costs between sites. Two heuristics called MaxSum and MaxDist are described for selecting filtering points that maximize their combined dominating potential across sites to improve distributed skyline query processing performance.
For a new better version of this tutorial see my Google Slides with embedded videos.
https://docs.google.com/presentation/d/1MftEOT3uvYpCVwUaLMhsesm5Que-Kr7GQRV4pKZ2SNQ/edit?usp=sharing
This is a 2016 tutorial on how to do watershed delineation using ArcMap 10. It is an open education resource. Please let me know if you find it useful or see something that could be improved. Feel free to use it for teaching Geographic Information Science.
Characteristics of an on chip cache on nec sxLéia de Sousa
The document discusses characteristics of an on-chip cache, called a vector cache, for the NEC SX vector architecture. It evaluates the performance of the vector cache with varying memory bandwidth rates from 1 to 4 bytes per flop. The evaluation uses kernel loops and five leading scientific applications from areas like GPR simulation, APFA simulation, PRF simulation, and SFHT simulation. The results show that the vector cache can boost computational efficiency for lower memory bandwidth systems, with a 2 bytes per flop system achieving performance comparable to a 4 bytes per flop system when cache hit rates exceed 50%.
Similar to Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing (20)
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing
1. GeoMesa: Using
Accumulo for optimized
spatio-temporal
processing
Dr. James Hughes, CCRi
james.hughes@ccri.com
2. GeoMesa is
● A collection of libraries and modules which can be used to
solve Big Geo Data problems
○ Great for managing billions to trillions of vector data
○ Great for streaming vector data
● Open sourced through Eclipse’s LocationTech working group and has
graduated incubation
● Built on top of great open source libraries
GeoMesa Background
3. Such architectures allow for live views and near-real time processing (speed layer)
while persisting the data for historic queries and batch analysis (batch layer).
Client access to both layers can be handled by GeoServer.
GeoMesa enables Lambda architectures
4. Suppose we wish to monitor and understand a group of GPS-enabled and
internet-enabled devices (ex: sensors, vehicles).
● GeoMesa’s ETL / converter library aids in re-usable data modeling.
● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest
into Accumulo and Kafka topics.
● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as
1) geo-fencing, 2) location trackers, and 3) complex alerting rules.
● Effective storage in Accumulo allows for fast query returns.
● End-to-end visualization and analysis supports allows aggregations to pushed
down to the Accumulo tablet servers.
● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc
interactive analysis and data discovery.
Example Use Case: Managing Internet-Aware Devices
5. Suppose we wish to monitor and understand a group of GPS-enabled and
internet-enabled devices (ex: sensors, vehicles).
● GeoMesa’s ETL / converter library aids in re-usable data modeling.
● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest
into Accumulo and Kafka topics.
● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as
1) geo-fencing, 2) location trackers, and 3) complex alerting rules.
● Effective storage in Accumulo allows for fast query returns.
● End-to-end visualization and analysis supports allows aggregations to pushed
down to the Accumulo tablet servers.
● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc
interactive analysis and data discovery.
All of this adds up to “Speed! Speed! Speed!” whether you are looking at
a live view of the data or pulling back an analysis product.
Example Use Case: Managing Internet-Aware Devices
6. Enabling and making visualization and analysis quick has been a journey and this
talk is about our steps so far
Talk Outline
7. Enabling and making visualization and analysis quick has been a journey and this
talk is about our steps so far
1. Space-filling curves and storing spatio-temporal data
2. Improvements to GeoMesa use and implementation of Accumulo Iterators
3. Spark and MapReduce for distributed computation
Talk Outline
8. Enabling and making visualization and analysis quick has been a journey and this
talk is about our steps so far
1. Space-filling curves and storing spatio-temporal data
2. Improvements to GeoMesa use and implementation of Accumulo Iterators
3. Spark and MapReduce for distributed computation
Not in this talk
1. Storm / NiFi - Streaming Ingest
2. Live views and online processing with Kafka
3. Command line tools
4. ETL / parser library
5. Machine learning / Deep Analytics
Talk Outline
9. ● Accumulo Key Design
● Space Filling Curves 101
● Indices for Points with Time
● Indices for Lines and Polygons
● Lessons Learned
GeoMesa's
evolution of
Accumulo
schemas
10. In a traditional stack, the application
issues queries to a database which is
responsible for query planning.
Overview of query planning in Accumulo
11. In a traditional stack, the application
issues queries to a database which is
responsible for query planning.
Overview of query planning in Accumulo
With Accumulo, the query planning is
handled by library code in the
application.
12. ● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
Space Filling Curves (in one slide!)
13. ● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
Space Filling Curves (in one slide!)
14. ● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order
that the curve visits the them.
○ Associate the data in that grid cell
with a byte representation of the
label.
Space Filling Curves (in one slide!)
15. ● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order
that the curve visits the them.
○ Associate the data in that grid cell
with a byte representation of the
label.
● We prefer “good” space filling curves:
○ Want recursive curves and locality.
Space Filling Curves (in one slide!)
16. ● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order
that the curve visits the them.
○ Associate the data in that grid cell
with a byte representation of the
label.
● We prefer “good” space filling curves:
○ Want recursive curves and locality.
● Space filling curves have higher
dimensional analogs.
Space Filling Curves (in one slide!)
17. To query for points in the grey rectangle, the
query planner enumerates a collection of index
ranges which cover the area.
Note: Most queries won’t line up perfectly with the
gridding strategy.
Further filtering can be run on the Accumulo
tablet servers with Iterators (next section)
or we can return ‘loose’ bounding box results
(likely more quickly).
Query planning with Space Filling Curves
18. GeoMesa has several tables; each optimized for a particular use case.
The Z3 table is used with and optimized for temporal point data. (Think sensor
observations, track reports, or other events which happen at particular location.)
GeoMesa Key Structure for the ‘Z3’ table
Key Value
Row
Column
Record
Family Qualifier
Shard
1-Byte
Epoch
Week
2-Bytes
Z3(x,y,t)
8-Bytes
‘F’
Here and now:
(38.9864985, -76.9561856)
10:15am, Tuesday, Oct. 11th, 2016
Epoch Week: 2440
X value: 1275689
Y value: 151972
T value: 2097151
Z3 (as a long):
6430470637115132837
19. Most approaches to indexing non-point
geometries involve covering the
geometry with a number of grid cells
and storing a copy with each index.
This means that the client has to
deduplicate results which is expensive.
Indexing non-point geometries: New XZ Index
20. Most approaches to indexing non-point
geometries involve covering the
geometry with a number of grid cells
and storing a copy with each index.
This means that the client has to
deduplicate results which is expensive.
Böhm, Klump, and Kriegel describe an
indexing strategy allows such
geometries to be stored once.
GeoMesa has implemented this
strategy in XZ2 (spatial-only) and XZ3
(spatio-temporal) tables.
The key is to store data by resolution,
separate geometries by size, and then
index them by their lower left corner.
This does require consideration on the
query planning side, but avoiding
deduplication is worth the trade-off.
Indexing non-point geometries: New XZ Index
For more details, see Böhm, Klump, and Kriegel. “XZ-ordering: a space-filling curve for objects with spatial
extension.” 6th. Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China.
(http://www.dbs.ifi.lmu.de/Publikationen/Boehm/Ordering_99.pdf)
21. ● Accumulo Iterator Overview
● GeoMesa Iterators for Analysis
and Visualization
● Iterator Lessons Learned
GeoMesa's use
of Accumulo
Iterators
22. “Iterators provide a modular mechanism for adding functionality to be executed by
TabletServers when scanning or compacting data. This allows users to efficiently
summarize, filter, and aggregate data.” -- Accumulo 1.7 documentation
Part of the modularity is that the iterators can be stacked:
t the output of one can be wired into the next.
Example: The first iterator might read from disk, the second could filter with
Authorizations, and a final iterator could filter by column family.
Other notes:
● Iterators provided a sorted view of the key/values.
● Iterator code can be loaded from HDFS and namespaced!
Accumulo Iterators
24. Visualization Example: Heatmaps
Without powerful visualization options,
big data is big nonsense.
Consider this view of shipping in the
Mediterranean sea
Heatmaps help show patterns and
they can be accelerated with
GeoMesa
25. Visualization Example: Heatmaps
Without powerful visualization options,
big data is big nonsense.
Consider this view of shipping in the
Mediterranean sea
Heatmaps help show patterns and
they can be accelerated with
GeoMesa
Heatmap
Request
HeatMap WPS
Query Hints
26. A request to GeoMesa consists of two broad pieces:
1. A filter restricting the data to act on, e.g.:
a. Records in Maryland with ‘Accumulo’ in the text field.
b. Records during the first week of 2016.
2. A request for ‘how’ to return the data, e.g.:
a. Return the full records
b. Return a subset of the record (either a projection or ‘bin’ file format)
c. Return a histogram
d. Return a heatmap / kernel density
Generally, a filter can be handled partially by selecting which ranges to scan; the
remainder can be handled by an Iterator.
Modifications to selected data can also be handled by a GeoMesa Iterator.
GeoMesa Data Requests
27. The first pass of GeoMesa iterators separated concerns into separate iterators.
The GeoMesa query planner assembled a stack of iterators to achieve the desired
result.
Initial GeoMesa Iterator design
Image from “Spatio-temporal Indexing in Non-relational Distributed Databases” by
Anthony Fox, Chris Eichelberger, James Hughes, Skylar Lyon
28. The key benefit to having decomposed iterators is that they are easier to
understand and re-mix.
In terms of performance, each one needs to understand the bytes in the Key and
Value. In many cases, this will lead to additional serialization/deserialization.
Now, we prefer to write Iterators which handle transforming the underlying data
into what the client code is expecting in one go.
Second GeoMesa Iterator design
29. 1. Using fewer iterators in the stack can be beneficial
2. Using lazy evaluation / deserialization for filtering Values can power speed
improvements.
3. Iterators take in Sorted Keys + Values and *must* produce Sorted Keys and
Values.
4. Accumulo 1.8.0 has an Iterator Test Harness!
https://accumulo.apache.org/release_notes/1.8.0#iterator-test-harness
https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterator_testing
Lessons learned about Iterators
30. Through our use of a) space filling curves, b) a cost-based query optimizer, and
c) carefully configured iterators, the GeoMesa query planner has a lot going on.
The GeoMesa query explainer logs 1) which index was used, 2) which ranges
where scanned, 3) Iterator configuration, etc.
Putting all together: the GeoMesa Query Explainer
geomesa> geomesa explain -u USER -p PASS -i INSTANCE -c geomesa -z zoo1,zoo2,zoo3 -f AccumuloQuickStart -q "Who =
'Bierce'"
Planning 'AccumuloQuickStart' Who = 'Bierce'
Original filter: Who = 'Bierce'
Hints: density[false] bin[false] stats[false] map-aggregate[false] sampling[none]
Sort: none
Transforms: None
Strategy selection:
Query processing took 69ms and produced 1 options
Filter plan: FilterPlan[ATTRIBUTE[Who = 'Bierce'][None]]
Strategy selection took 8ms for 1 options
Strategy 1 of 1: AttributeIdxStrategy
Strategy filter: ATTRIBUTE[Who = 'Bierce'][None]
Plan: org.locationtech.geomesa.accumulo.index.BatchScanPlan
Table: geomesa_attr
Deduplicate: false
Column Families: all
Ranges (1): [%01;%00;%00;Bierce%00;::%01;%00;%00;Bierce%01;)
Iterators (0):
Query planning took 119ms
Verify hints
Inspect strategies considered
See table and ranges to be scanned
Quantify planning time
31. ● GeoMesa + Spark Setup
● GeoMesa + Spark Analytics
● GeoMesa powered notebooks
(Jupyter and Zeppelin)
GeoMesa’s
Spark Support:
Data Analysis
and Discovery
32. Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
GeoMesa MapReduce and Spark Support
33. Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
GeoMesa MapReduce and Spark Support
34. Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
Spark provides a way to change InputFormats into RDDs.
GeoMesa MapReduce and Spark Support
35. Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
Spark provides a way to change InputFormats into RDDs.
So with a little glue code and Spark classpath/environment
management, GeoMesa has Spark support!
GeoMesa MapReduce and Spark Support
36. GeoMesa Spark Example 1: Time Series
Step 1: Get an RDD[SimpleFeature]
Step 2: Calculate the time series
Step 3: Plot the time series in R.
37. Using one dataset (country boundaries) to group another (here, GDELT) is
effectively a join.
Our summer intern, Atallah, worked out the details of doing this analysis in Spark
and created a tutorial and blog post.
This picture shows ‘stability’ of a region from GDELT Goldstein values
GeoMesa Spark Example 2: Aggregating by Regions
http://www.ccri.com/2016/08/17/new-geomesa-tutorial-aggregating-visualizing-data/
http://www.geomesa.org/documentation/tutorials/shallow-join.html
38. GeoMesa Spark Example 3: Aggregating Tweets about #traffic
Virginia Polygon CQL
GeoMesa RDD
Aggregate by County
Calculate ratio of #traffic
Store back to GeoMesa
39. GeoMesa Spark Example 3: Aggregating Tweets about #traffic
#traffic by Virginia county
Darker blue has a higher count
40. Problem: Another developer came by and mentioned that his Spark job using
GeoMesa had quite a few tasks (far more than expected).
Around the same time, Eugene Cheipesh (Azavea / GeoTrellis) wrote in to the
Accumulo user list…
In Accumulo 1.6.x, each range in the Accumulo InputFormat becomes a Split.
With space filling curves, it is easy to enumerate plenty of ranges for a query.
Solution: The short term solution was to create a custom InputFormat which
produce Splits which contain more than one range.
A small bump in the road…
41. Interactive Data Discovery at Scale in GeoMesa Notebooks
Writing (and debugging!) MapReduce /
Spark jobs is slow and requires
expertise.
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive
‘notebook’ servers like Apache
Zeppelin and Jupyter (formerly
iPython Notebook).
42. Interactive Data Discovery at Scale in GeoMesa Notebooks
Writing (and debugging!) MapReduce /
Spark jobs is slow and requires
expertise.
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive
‘notebook’ servers like Apache
Zeppelin and Jupyter (formerly
iPython Notebook).
There are two big things to work out:
1. Getting the right libraries on the
classpath.
2. Wiring up visualizations.
43. Interactive Data Discovery at Scale in GeoMesa Notebooks
GeoMesa Notebook Roadmap:
● Improved JavaScript integration
● D3.js and other visualization
libraries
● OpenLayers and Leaflet
● Python Bindings
44. Questions?
Find out more at http://geomesa.org
Connect with us on Gitter:
https://gitter.im/locationtech/geomes
a
See applications at CCRi’s blog:
http://www.ccri.com/blog/
47. GeoMesa Converter Library
The Converter library is used in
1. The GeoMesa command line tools
2. GeoMesa’s NiFi processors
Configurations support XML, CSV, TSV JSON, Avro, and more!
Examples are available for GeoNames, GDELT,OSM-GPX, Twitter, and others.
48. Live view with the GeoMesa Kafka DataStore
Q: How did you get billions of points?
A: Data is streaming in continually.
Examples come from IoT related
applications:
10 thousand sensors reporting
every 5 seconds generate 1.2 billion
records in a week.
In these cases, we want to see where
things are right now.
49. GeoMesa Kafka DataStore Architecture
We have two issues to address:
1. In-memory index of
SimpleFeatures
2. Durable message passing system
For indexing, we use a combination of
Guava and CQEngine (efficient Java
collections).
Kafka serves as the message passing
system.
Consumer KDSes can be run in Storm
(for event processing), GeoServer (OGC
access), etc.
50. Z-Order Hilbert
Around 100 years ago, mathematicians asked the question,
“Is there a continuous function from the unit interval to the unit square
which covers it?”
Space Filling Curves: The Math
Row-Major