A presentation by Zohar Elkayam, Solutions Architect at Aerospike Israel from the IronSource meetup in Israel (December 2019).
The topic of this talk was Nested CDT (list and map) improvements in Aerospike version 4.6 and 4.7.
This included an explanation about data modeling and code samples of such implementation.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
A talk given by Julian Hyde at ApacheCon NA 2018 in Montreal on September 26th, 2018.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
This document discusses using MapReduce to find the top K records in a distributed dataset based on a specific criteria. It begins by explaining MapReduce and its limitations. It then describes finding the top K records on a single machine by sorting the data and selecting the top K. For MapReduce, each mapper finds the top K records within its split and sends to the reducer. The reducer finds the global top K by sorting all records and selecting the top K overall. An example algorithm and sample data are provided to demonstrate how to implement a MapReduce job to solve this problem.
아파치 네모로 빠르고 효율적으로 빅데이터 처리하기
- 송원욱, 양영석(서울대학교 컴퓨터공학부 소프트웨어 플랫폼 연구실)
개요 #
아파치 네모(Apache Nemo)는 빅데이터 애플리케이션의 분산 수행 방식을 다양한 자원 환경 및 데이터 특성에 맞춰 최적화하는 시스템입니다. Geo-distributed resources, transient resources, large data shuffle, skewed data 처리 상황에서 아파치 네모는 아파치 스파크(Apache Spark) 보다 월등하게 높은 성능을 보입니다.
목차 #
아파치 네모의 최적화 케이스 스터디
아파치 네모의 분산 실행 과정
앞으로의 연구 방향
The document discusses providing easy access to HDF data via NCL, IDL, and MATLAB. It presents examples and code snippets for reading HDF data from various NASA data centers like GES DISC, MODAPS, NSIDC, and LP-DAAC into the three software packages. Common issues when working with HDF files like HDF-EOS2 swaths with dimension maps and different ways metadata is stored are also addressed. The overall goal is to help lower the learning curve for users who want to analyze HDF data in their favorite analysis packages.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
A talk given by Julian Hyde at ApacheCon NA 2018 in Montreal on September 26th, 2018.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
This document discusses using MapReduce to find the top K records in a distributed dataset based on a specific criteria. It begins by explaining MapReduce and its limitations. It then describes finding the top K records on a single machine by sorting the data and selecting the top K. For MapReduce, each mapper finds the top K records within its split and sends to the reducer. The reducer finds the global top K by sorting all records and selecting the top K overall. An example algorithm and sample data are provided to demonstrate how to implement a MapReduce job to solve this problem.
아파치 네모로 빠르고 효율적으로 빅데이터 처리하기
- 송원욱, 양영석(서울대학교 컴퓨터공학부 소프트웨어 플랫폼 연구실)
개요 #
아파치 네모(Apache Nemo)는 빅데이터 애플리케이션의 분산 수행 방식을 다양한 자원 환경 및 데이터 특성에 맞춰 최적화하는 시스템입니다. Geo-distributed resources, transient resources, large data shuffle, skewed data 처리 상황에서 아파치 네모는 아파치 스파크(Apache Spark) 보다 월등하게 높은 성능을 보입니다.
목차 #
아파치 네모의 최적화 케이스 스터디
아파치 네모의 분산 실행 과정
앞으로의 연구 방향
The document discusses providing easy access to HDF data via NCL, IDL, and MATLAB. It presents examples and code snippets for reading HDF data from various NASA data centers like GES DISC, MODAPS, NSIDC, and LP-DAAC into the three software packages. Common issues when working with HDF files like HDF-EOS2 swaths with dimension maps and different ways metadata is stored are also addressed. The overall goal is to help lower the learning curve for users who want to analyze HDF data in their favorite analysis packages.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
Schema Design by Chad Tindel, Solution Architect, 10genMongoDB
MongoDB’s basic unit of storage is a document. Documents can represent rich, schema-free data structures, meaning that we have several viable alternatives to the normalized, relational model. In this talk, we’ll discuss the tradeoff of various data modeling strategies in MongoDB using a library as a sample application. You will learn how to work with documents, evolve your schema, and common schema design patterns.
This document discusses heaps and their use in implementing priority queues. It describes how a max-heap or min-heap is a complete binary tree that satisfies the heap property, where each internal node is greater than or equal to its children. It explains how a heap can be represented using a simple array and how to build a heap from an unsorted array in O(n) time by sifting nodes down. Deleting the root element and maintaining the heap property takes O(log n) time. Heap sort uses a heap to sort an array in O(n log n) time. Priority queues can be efficiently implemented using max-heaps.
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - StratosphereEuropean Data Forum
Stratosphere is a collaborative research project between universities to build an open-source platform for big data analytics. It bridges relational databases and MapReduce using a functional programming language called Meteor. The platform includes data pools, tools for data linkage and analysis, and a scalable execution engine called Nephele. Stratosphere is optimized for parallelism using its PACT programming model and optimizer. Ongoing work focuses on UDFs, caching, and advancing the MapReduce paradigm.
Taming the Tiger: Tips and Tricks for Using TelegrafInfluxData
Taming the Tiger: Tips and Tricks for Using Telegraf
As part of InfluxDays North America 2020 Virtual Experience, the Technical Services team will be offering a free live InfluxDB training to the first 100 registered attendees.This will be hosted over Zoom and Slack with two main trainers and there will be assistants to help participants with the course work. The training will be recorded and made available on the InfluxDays website and the InfluxData YouTube channel.
The course provides an introduction to using Telegraf within a hands-on lab setting. Attendees will be presented a series of lab exercises and get the chance to work through them with the assistance of our remote proctors. After taking this class, attendants will be able to:
Articulate the purposes and value of Telegraf
Understand the basics of configuring and running Telegraf
Understand how to manipulate incoming data to optimize InfluxDB schema
Visualize the insertion results using InfluxDB Cloud UI
Processing Geospatial at Scale at LocationTechRob Emanuele
This document discusses processing large geospatial data at scale. It provides background on geospatial concepts like raster and vector data. It then discusses big data frameworks like Hadoop, Spark, and Accumulo that can be used to process geospatial data in parallel across large clusters. Finally, it presents several LocationTech projects like GeoTrellis, GeoJinni, and GeoWave that build geospatial capabilities on top of these frameworks to allow distributed processing and querying of large raster and vector maps.
The MathWorks introduced MATLAB support for HDF5 in 2002 via three high-level functions: HDF5INFO, HDF5READ, and HDF5WRITE. These functions worked well for their purpose-providing simple interfaces to a complicated file format-but MATLAB users requested finer control over their HDF5 files and the HDF5 library. MATLAB 7.3 (R2006b) adds this precise level of support for version 1.6.5 of the HDF5 library via a close mapping of the HDF5 C API to MATLAB function calls.
This presentation will briefly introduce the earlier, high-level HDF5 interface (and its limitations) before showing in detail the low-level HDF5 functions. It will show how to interact with the HDF5 library and files using the thirteen classes of functions in MATLAB, which encapsulate groupings of functionality found in the HDF5 C API. But because MATLAB is itself a higher-level language than C, we will also present MATLAB's extensions and modifications of the HDF5 C API that make it more MATLAB-like, work with defined values, and perform ID and memory management.
Wrapping a library like HDF5 requires a great deal of effort and design, and we will briefly present a general-purpose mechanism for creating close mappings between library interfaces and an application like MATLAB. One of our goals in this presentation is to facilitate communication with The HDF Group about how The MathWorks builds our HDF5 interfaces in order to ease adoption of future versions of the HDF5 library in large, general-purpose applications.
This document discusses Hadoop and MapReduce. It describes how Hadoop uses MapReduce and how it was inspired by Google's implementation. It provides details on the key components of Hadoop including HDFS, JobTracker, TaskTracker, NameNode and DataNode. It also provides examples of using Hadoop with different programming languages like Java, Python and C/C++ and discusses tuning Hadoop performance.
Processing Geospatial Data At Scale @locationtechRob Emanuele
This document discusses processing large geospatial data at scale. It provides background on big data frameworks like Apache Hadoop, Apache Spark, and geospatial projects like GeoTrellis, GeoWave, and SpatialHadoop that enable processing geospatial data using these frameworks. The document outlines how these tools allow geospatial data from sources like satellite imagery, OpenStreetMap, and geotagged social media to be analyzed using distributed computing platforms and algorithms.
This is a slide deck that I have been using to present on GeoTrellis for various meetings and workshops. The information is speaks to GeoTrellis pre-1.0 release in Q4 of 2016.
Well designed tables like partitioning and bucketing can improve query speed and reduce costs. Partitioning involves horizontally slicing data, such as by date or location. Bucketing imposes structure allowing more efficient queries, sampling, and map-side joins. Parallel query execution allows subqueries to run simultaneously to improve performance. The explain command helps analyze queries and identify optimizations.
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...InfluxData
This document discusses top hurdles for Flux beginners and provides solutions. It covers 10 hurdles: 1) Overlooking UI tools for writing Flux, 2) Misunderstanding annotated CSVs, 3) Data layout design leading to high cardinality, 4) Storing wrong data in InfluxDB, 5) Confusion about projecting multiple aggregations, 6) Unfamiliarity with common Flux packages and functions, 7) Unawareness of performance optimizations, 8) Improper use of schema mutations, 9) Knowing when to write custom tasks, and 10) Not understanding when and why to downsample data. For each hurdle, it provides explanations and code examples to illustrate solutions.
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsRob Emanuele
LocationPowers OGC BigGeoData 2016
This presentation will discuss tools in the open source landscape that are used to handle big geospatial data. In particular, we will focus on how Apache frameworks such as Spark and Accumulo are "geospatially enabled" by four projects: GeoTrellis, GeoWave, GeoMesa, and GeoJinni. These four projects all participate in LocationTech, a working group under the Eclipse Foundation. In particular, we will discuss how each of these LocationTech technologies implement spatial indexing (e.g. by using space filling curves) in order to provide quick access to data, and other common themes among the four projects. Attendees should walk away from this presentation understanding important parts of the Apache big data ecosystem, a set of LocationTech projects that belong to the cutting edge of enabling those Apache project's handling of geospatial data, as well as some solutions to common problems when dealing with large geospatial data.
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
This document provides an introduction to the R programming language presented by Alex Storer at ComputeFest 2012. It discusses why R should be used over other languages like MATLAB and Python, provides examples of basic R syntax and functions, and walks through an example of loading climate data and creating plots to visualize rainfall anomalies over time. The goal is to provide attendees with a foundation of R basics while working through a real data analysis problem.
1. Hadoop is a framework for distributed processing of large datasets across clusters of computers.
2. Hadoop can be used to perform tasks like large-scale sorting and data analysis faster than with traditional databases like MySQL.
3. Example applications of Hadoop include processing web server logs, managing user profiles for a large website, and performing machine learning on massive datasets.
Latent Semantic Analysis of Wikipedia with SparkSandy Ryza
This document describes how to perform latent semantic analysis (LSA) on Wikipedia data using Apache Spark. It discusses parsing Wikipedia XML data, creating a term-document matrix, applying singular value decomposition to reduce the matrix's rank, and interpreting the results to find concepts and related documents. Key steps include parsing Wikipedia pages into terms and documents, cleaning the data through lemmatization and removing stop words, creating a tf-idf weighted term-document matrix, applying SVD to get U, S, and V matrices, and using these to find terms and documents most strongly related to given queries.
This webinar will give an overview of CREATE STATISTICS in PostgreSQL. This command allows the database to collect multi-column statistics, helping the optimizer understand dependencies between columns, produce more accurate estimates, and better query plans.
The following key topics will be covered during the webinar:
- Why CREATE STATISTICS may be needed at all
- How the command works
- Which cases CREATE STATISTICS already addresses
- What improvements are in the queue for future PostgreSQL versions (either already committed to PostgreSQL 13 or beyond)
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
Schema Design by Chad Tindel, Solution Architect, 10genMongoDB
MongoDB’s basic unit of storage is a document. Documents can represent rich, schema-free data structures, meaning that we have several viable alternatives to the normalized, relational model. In this talk, we’ll discuss the tradeoff of various data modeling strategies in MongoDB using a library as a sample application. You will learn how to work with documents, evolve your schema, and common schema design patterns.
This document discusses heaps and their use in implementing priority queues. It describes how a max-heap or min-heap is a complete binary tree that satisfies the heap property, where each internal node is greater than or equal to its children. It explains how a heap can be represented using a simple array and how to build a heap from an unsorted array in O(n) time by sifting nodes down. Deleting the root element and maintaining the heap property takes O(log n) time. Heap sort uses a heap to sort an array in O(n log n) time. Priority queues can be efficiently implemented using max-heaps.
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - StratosphereEuropean Data Forum
Stratosphere is a collaborative research project between universities to build an open-source platform for big data analytics. It bridges relational databases and MapReduce using a functional programming language called Meteor. The platform includes data pools, tools for data linkage and analysis, and a scalable execution engine called Nephele. Stratosphere is optimized for parallelism using its PACT programming model and optimizer. Ongoing work focuses on UDFs, caching, and advancing the MapReduce paradigm.
Taming the Tiger: Tips and Tricks for Using TelegrafInfluxData
Taming the Tiger: Tips and Tricks for Using Telegraf
As part of InfluxDays North America 2020 Virtual Experience, the Technical Services team will be offering a free live InfluxDB training to the first 100 registered attendees.This will be hosted over Zoom and Slack with two main trainers and there will be assistants to help participants with the course work. The training will be recorded and made available on the InfluxDays website and the InfluxData YouTube channel.
The course provides an introduction to using Telegraf within a hands-on lab setting. Attendees will be presented a series of lab exercises and get the chance to work through them with the assistance of our remote proctors. After taking this class, attendants will be able to:
Articulate the purposes and value of Telegraf
Understand the basics of configuring and running Telegraf
Understand how to manipulate incoming data to optimize InfluxDB schema
Visualize the insertion results using InfluxDB Cloud UI
Processing Geospatial at Scale at LocationTechRob Emanuele
This document discusses processing large geospatial data at scale. It provides background on geospatial concepts like raster and vector data. It then discusses big data frameworks like Hadoop, Spark, and Accumulo that can be used to process geospatial data in parallel across large clusters. Finally, it presents several LocationTech projects like GeoTrellis, GeoJinni, and GeoWave that build geospatial capabilities on top of these frameworks to allow distributed processing and querying of large raster and vector maps.
The MathWorks introduced MATLAB support for HDF5 in 2002 via three high-level functions: HDF5INFO, HDF5READ, and HDF5WRITE. These functions worked well for their purpose-providing simple interfaces to a complicated file format-but MATLAB users requested finer control over their HDF5 files and the HDF5 library. MATLAB 7.3 (R2006b) adds this precise level of support for version 1.6.5 of the HDF5 library via a close mapping of the HDF5 C API to MATLAB function calls.
This presentation will briefly introduce the earlier, high-level HDF5 interface (and its limitations) before showing in detail the low-level HDF5 functions. It will show how to interact with the HDF5 library and files using the thirteen classes of functions in MATLAB, which encapsulate groupings of functionality found in the HDF5 C API. But because MATLAB is itself a higher-level language than C, we will also present MATLAB's extensions and modifications of the HDF5 C API that make it more MATLAB-like, work with defined values, and perform ID and memory management.
Wrapping a library like HDF5 requires a great deal of effort and design, and we will briefly present a general-purpose mechanism for creating close mappings between library interfaces and an application like MATLAB. One of our goals in this presentation is to facilitate communication with The HDF Group about how The MathWorks builds our HDF5 interfaces in order to ease adoption of future versions of the HDF5 library in large, general-purpose applications.
This document discusses Hadoop and MapReduce. It describes how Hadoop uses MapReduce and how it was inspired by Google's implementation. It provides details on the key components of Hadoop including HDFS, JobTracker, TaskTracker, NameNode and DataNode. It also provides examples of using Hadoop with different programming languages like Java, Python and C/C++ and discusses tuning Hadoop performance.
Processing Geospatial Data At Scale @locationtechRob Emanuele
This document discusses processing large geospatial data at scale. It provides background on big data frameworks like Apache Hadoop, Apache Spark, and geospatial projects like GeoTrellis, GeoWave, and SpatialHadoop that enable processing geospatial data using these frameworks. The document outlines how these tools allow geospatial data from sources like satellite imagery, OpenStreetMap, and geotagged social media to be analyzed using distributed computing platforms and algorithms.
This is a slide deck that I have been using to present on GeoTrellis for various meetings and workshops. The information is speaks to GeoTrellis pre-1.0 release in Q4 of 2016.
Well designed tables like partitioning and bucketing can improve query speed and reduce costs. Partitioning involves horizontally slicing data, such as by date or location. Bucketing imposes structure allowing more efficient queries, sampling, and map-side joins. Parallel query execution allows subqueries to run simultaneously to improve performance. The explain command helps analyze queries and identify optimizations.
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...InfluxData
This document discusses top hurdles for Flux beginners and provides solutions. It covers 10 hurdles: 1) Overlooking UI tools for writing Flux, 2) Misunderstanding annotated CSVs, 3) Data layout design leading to high cardinality, 4) Storing wrong data in InfluxDB, 5) Confusion about projecting multiple aggregations, 6) Unfamiliarity with common Flux packages and functions, 7) Unawareness of performance optimizations, 8) Improper use of schema mutations, 9) Knowing when to write custom tasks, and 10) Not understanding when and why to downsample data. For each hurdle, it provides explanations and code examples to illustrate solutions.
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsRob Emanuele
LocationPowers OGC BigGeoData 2016
This presentation will discuss tools in the open source landscape that are used to handle big geospatial data. In particular, we will focus on how Apache frameworks such as Spark and Accumulo are "geospatially enabled" by four projects: GeoTrellis, GeoWave, GeoMesa, and GeoJinni. These four projects all participate in LocationTech, a working group under the Eclipse Foundation. In particular, we will discuss how each of these LocationTech technologies implement spatial indexing (e.g. by using space filling curves) in order to provide quick access to data, and other common themes among the four projects. Attendees should walk away from this presentation understanding important parts of the Apache big data ecosystem, a set of LocationTech projects that belong to the cutting edge of enabling those Apache project's handling of geospatial data, as well as some solutions to common problems when dealing with large geospatial data.
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
This document provides an introduction to the R programming language presented by Alex Storer at ComputeFest 2012. It discusses why R should be used over other languages like MATLAB and Python, provides examples of basic R syntax and functions, and walks through an example of loading climate data and creating plots to visualize rainfall anomalies over time. The goal is to provide attendees with a foundation of R basics while working through a real data analysis problem.
1. Hadoop is a framework for distributed processing of large datasets across clusters of computers.
2. Hadoop can be used to perform tasks like large-scale sorting and data analysis faster than with traditional databases like MySQL.
3. Example applications of Hadoop include processing web server logs, managing user profiles for a large website, and performing machine learning on massive datasets.
Latent Semantic Analysis of Wikipedia with SparkSandy Ryza
This document describes how to perform latent semantic analysis (LSA) on Wikipedia data using Apache Spark. It discusses parsing Wikipedia XML data, creating a term-document matrix, applying singular value decomposition to reduce the matrix's rank, and interpreting the results to find concepts and related documents. Key steps include parsing Wikipedia pages into terms and documents, cleaning the data through lemmatization and removing stop words, creating a tf-idf weighted term-document matrix, applying SVD to get U, S, and V matrices, and using these to find terms and documents most strongly related to given queries.
This webinar will give an overview of CREATE STATISTICS in PostgreSQL. This command allows the database to collect multi-column statistics, helping the optimizer understand dependencies between columns, produce more accurate estimates, and better query plans.
The following key topics will be covered during the webinar:
- Why CREATE STATISTICS may be needed at all
- How the command works
- Which cases CREATE STATISTICS already addresses
- What improvements are in the queue for future PostgreSQL versions (either already committed to PostgreSQL 13 or beyond)
Richard Cole and Adam Gray from AWS hosted an Elastic MapReduce Office Hours session on April 13th, 2011 to discuss new features and answer questions. They demonstrated how to resize a running job flow and launch a Hive-based data warehouse to analyze contextual advertising data from impressions and clicks tables stored in S3. Office Hours provides a forum for technical discussions with AWS experts but is not intended for support or information about unreleased services.
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftAmazon Web Services
The document summarizes best practices for migrating legacy data warehouses to Amazon Redshift. It covers architectural concepts like columnar storage and compression, data distribution styles, sort keys to optimize query performance, and materializing dimension columns in fact tables. The presentation provides an overview of these topics and their impact on storage, I/O and querying. Real-world examples are also given to illustrate key points.
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
This document discusses bridging unstructured and structured data with Hadoop and Vertica. It describes using Hadoop to extract and structure unstructured investment data from the web. Then it uses Pig to add zip code data and store the results in Vertica. Finally, it explains how Vertica can be used for reporting and data visualization of the structured data for analysis.
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...Altinity Ltd
This document discusses using ClickHouse to manage log data. It begins with an introduction to ClickHouse and its features. It then covers different ways to model log data in ClickHouse, including storing logs as JSON blobs or converting them to a tabular format. The document demonstrates using materialized views to ingest logs into ClickHouse tables in an efficient manner, extracting values from JSON and converting to columns. It shows how this approach allows flexible querying of log data while scaling to large volumes.
ClickHouse materialized views - a secret weapon for high performance analytic...Altinity Ltd
ClickHouse materialized views allow you to precompute aggregates and transform data to improve query performance. Materialized views can store precomputed aggregates from a source table to speed up aggregation queries over 100x. They can also retrieve the last data point for each item over 100x faster than scanning the raw data table. Materialized views provide a way to optimize data storage layout and indexing to improve query efficiency.
Aerospike User Group: Exploring Data ModelingBrillix
Israeli Aerospike User Group meetup #1: Exploring Data Modeling - best practices with Aerospike data types - map, list and others by Ronen Botzer on July 10, 2018
1) The document provides instructions for setting up an AWS account and launching an EC2 instance with an AMI that contains tools and documentation for a hands-on tutorial on NoSQL databases and MongoDB.
2) The tutorial covers basic MongoDB commands and demonstrates how to create, insert, update, and query document data using the mongo shell client. Embedded and nested documents are explored along with geospatial queries.
3) A map-reduce example aggregates historical check-in data to calculate popular locations over different time periods, demonstrating how MongoDB supports batch operations.
A closer look at the fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. We'll show how to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
Speakers:
Karan Desai - Solutions Architect, AWS
Neel Mitra - Solutions Architect, AWS
Presentation from DICE Coder's Day (2010 November) by Andreas Fredriksson in the Frostbite team.
Goes into detail about Scope Stacks, which are a systems programming tool for memory layout that provides
• Deterministic memory map behavior
• Single-cycle allocation speed
• Regular C++ object life cycle for objects that need it
This makes it very suitable for games.
Real-Time Spark: From Interactive Queries to StreamingDatabricks
This document summarizes Michael Armbrust's presentation on real-time Spark. It discusses:
1. The goals of real-time analytics including having the freshest answers as fast as possible while keeping the answers up to date.
2. Spark 2.0 introduces unified APIs for SQL, DataFrames and Datasets to make developing real-time analytics simpler with powerful yet simple APIs.
3. Structured streaming allows running the same SQL queries on streaming data to continuously aggregate data and update outputs, unifying batch, interactive, and streaming queries into a single API.
The document provides an agenda for a seasoned developers track workshop. The agenda includes sessions on InfluxDB query language (IFQL), writing Telegraf plugins, using InfluxDB for open tracing, advanced Kapacitor techniques, setting up InfluxData for IoT, and database orchestration. There will also be breakfast, lunch, breaks and pizza/beer.
Apache Drill 1.0 has been released after nearly three years of development involving 45 code contributors and countless other contributors. Drill provides a SQL interface for analyzing both structured and unstructured data across numerous data sources. It aims to execute queries fast by leveraging columnar encodings and scaling out queries rather than scaling up. Drill also aims to support iterative exploration and querying of data without requiring data preparation. Future plans for Drill include continued monthly releases, integration with other technologies like JDBC and Cassandra, and tools to deploy Drill on EMR and EC2.
beyond tellerrand: Mobile Apps with JavaScript – There's More Than WebHeiko Behrens
abstract from http://2011.beyondtellerrand.com
Modern web technologies and responsive design aim at a platform independent code base while promising first-class experience on any mobile device. Even though purely web-based approaches can achieve stunning results, they (still) cannot compete with their native counterpart regarding platform features and integration.
In this talk, I will show you how we can use JavaScript to produce mobile apps that include features such as native UI, push notifications, sensors, and paid distribution. You can expect lots of live demos when I will compare the strengths and weaknesses of various frameworks.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
A Century Of Weather Data - Midwest.ioRandall Hunt
This document summarizes the key considerations and performance tests for storing and querying a large weather dataset containing over 2.5 billion data points. It describes the schema design using MongoDB to embed data and index on location. Bulk loading of data was 10 hours on a single server but only 3 hours on a sharded cluster. Queries for a single data point were fastest on the cluster at under 1ms while worldwide queries were faster at 310/second. Analytics like maximum temperature took 2.5 hours on a single server but only 2 minutes on the cluster. The cluster provided much higher throughput and better performance for complex queries while being more expensive.
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Similar to Aerospike Nested CDTs - Meetup Dec 2019 (20)
Aerospike Meetup - Introduction - Ami - 04 March 2020Aerospike
Introduction to Aerospike NoSQL Database. Session was delivered at "Big Data, Max Speed @ Minimal Cost" Meetup at Nielsen R&D Center in Tel Aviv, March 4, 2020.
Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04...Aerospike
How to leverage Spark with Aerospike NoSQL Database to get real time insights. Session was delivered at "Big Data, Max Speed @ Minimal Cost" Meetup at Nielsen R&D Center in Tel Aviv, March 4, 2020.
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020Aerospike
Aerospike at Nielsen customer story. Session was delivered at "Big Data, Max Speed @ Minimal Cost" Meetup at Nielsen R&D Center in Tel Aviv, March 4, 2020.
Aerospike Roadmap Overview - Meetup Dec 2019Aerospike
The document provides a summary of updates to Aerospike Enterprise Edition from May 2019 to December 2019, as well as planned updates through 2020. Key updates include adding compression, supporting nested data types and bitwise operations, improving scan and query performance, and adding capabilities like pagination. Planned updates focus on enhancing cross data center replication, secondary indexes, and the client-server protocol.
Aerospike Data Modeling - Meetup Dec 2019Aerospike
This is a presentation done by Ronen Botzer, the Director of Product at Aerospike as part of the IronSource meetup in Israel (December 2019).
In this talk, Ronen explained how to use nested CDTs and Bitwise operations in order to manage user segmentation and to create a proper data model.
JDBC Driver for Aerospike - Meetup Dec 2019Aerospike
The document describes a JDBC driver that allows SQL queries to be run against Aerospike databases. The driver provides SQL compliance by mapping SQL statements to the appropriate Aerospike operations. It supports statements like SELECT, INSERT, UPDATE, DELETE as well as functions, aggregation, JOINs and more. Future plans include improving performance, adding support for additional data types and operations, and deploying to a public repository. The goal is to provide a standard SQL interface for integrating Aerospike with various SQL-based tools and applications.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com