MapR is an amazing new distributed filesystem modeled after Hadoop. It maintains API compatibility with Hadoop, but far exceeds it in performance, manageability, and more.
/* Ted's MapR meeting slides incorporated here */
Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.
There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.
Accelerating Data Processing in Spark SQL with Pandas UDFsDatabricks
Spark SQL provides a convenient layer of abstraction for users to express their query’s intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases.
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 10000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages. We will also discuss best practices around using this new feature. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality in addition to the existing connectivity of Spark SQL make it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from pubsub systems like Kafka and Kinesis.
We'll walk through a concrete example where in less than 10 lines, we read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.
There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.
Accelerating Data Processing in Spark SQL with Pandas UDFsDatabricks
Spark SQL provides a convenient layer of abstraction for users to express their query’s intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases.
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 10000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages. We will also discuss best practices around using this new feature. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality in addition to the existing connectivity of Spark SQL make it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from pubsub systems like Kafka and Kinesis.
We'll walk through a concrete example where in less than 10 lines, we read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
In this hands on tutorial we will present Koalas, a new open source project. Koalas is an open source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
This document discusses Hivemall, an open source machine learning library for Apache Hive, Spark, and Pig. It provides concise summaries of Hivemall in 3 sentences or less:
Hivemall is a scalable machine learning library built as a collection of Hive UDFs that allows users to perform machine learning tasks like classification, regression, and recommendation using SQL queries. Hivemall supports many popular machine learning algorithms and can run in parallel on large datasets using Apache Spark, Hive, Pig, and other big data frameworks. The document outlines how to run a machine learning workflow with Hivemall on Spark, including loading data, building a model, and making predictions.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsDatabricks
This document summarizes Daniel Galvez's presentation on creating The People's Speech Dataset using Apache Spark and TPUs. The key points are:
1) The dataset aims to provide 86,000 hours of speech data with forced alignments between audio and transcripts in order to be challenging, free to use, and have a commercial license.
2) The conceptual workload is to take hour-long audio files, split them into 15 second segments, and use a pretrained speech recognition model to discover when each word in the transcript was said.
3) Creating the dataset encountered limitations with accelerator-aware scheduling in Spark, memory issues with PySpark UDFs, crashes in TPUs, and the need to reorder data by
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
In the past several years, the pandas UDFs are perhaps the most important changes to Apache Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Apache Spark 3.0, the pandas UDFs were redesigned by leveraging type hints.
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
This document discusses using Apache Drill and business intelligence (BI) tools to analyze network data stored in Hadoop. It provides examples of querying network packet captures and APIs directly using SQL without needing to transform or structure the data first. This allows gaining insights into issues like dropped sensor readings by analyzing packets alongside other data sources. The document concludes that SQL-on-Hadoop technologies allow network analysis to be done in a BI context more quickly than traditional specialized tools.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
This document discusses using Apache Spark and Amazon DSSTNE to generate product recommendations at scale. It summarizes that Amazon uses Spark and Zeppelin notebooks to allow data scientists to develop queries in an agile manner. Deep learning jobs are run on GPUs using Amazon ECS, while CPU jobs run on Amazon EMR. DSSTNE is optimized for large sparse neural networks and allows defining networks in a human-readable JSON format to efficiently handle Amazon's large recommendation problems.
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
This document discusses using Stanford NLP and Spark to extract relationships from unstructured text. It presents a pipeline for annotating entities in oil and gas supply chain text using NER, extracting relationships using pattern matching, and simplifying sentences. The pipeline is implemented using Spark for scalability and fault tolerance. Benefits of the approach include code reuse between batch and streaming layers and easy distribution of NLP processing.
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
This talk will focus on Journey of technical challenges, trade offs and ground-breaking achievements for building performant and scalable pipelines from the experience working with our customers.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
HPC in Barcelona is centered around the MareNostrum supercomputer and BSC's 425-person team from 40 countries. MareNostrum allows simulation and analysis in fields like life sciences, earth sciences, and engineering. To meet new demands of big data analytics, BSC developed the Spark4MN module to run Spark workloads on MareNostrum. Benchmarking showed Spark4MN achieved good speed-up and scale-out. Further work profiles Spark using BSC tools and benchmarks workloads like image analysis on different hardware. BSC's vision is to advance understanding through technologies like cognitive computing and deep learning.
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB
This document discusses analyzing flight data using MongoDB aggregation. It provides examples of aggregation pipelines to group, match, project, sort, unwind and other stages. It explores questions about major carriers, airport cancellations, delays by distance and carrier. It also discusses visualizing route data and hub airports. Finally, it proposes a quiz on analyzing NYC flight data by importing data and performing queries on origins, cancellations, delays and weather impacts by month between the three major NYC airports.
Creating a Modern Data Architecture for Digital TransformationMongoDB
By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges.
In this hands on tutorial we will present Koalas, a new open source project. Koalas is an open source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
This document discusses Hivemall, an open source machine learning library for Apache Hive, Spark, and Pig. It provides concise summaries of Hivemall in 3 sentences or less:
Hivemall is a scalable machine learning library built as a collection of Hive UDFs that allows users to perform machine learning tasks like classification, regression, and recommendation using SQL queries. Hivemall supports many popular machine learning algorithms and can run in parallel on large datasets using Apache Spark, Hive, Pig, and other big data frameworks. The document outlines how to run a machine learning workflow with Hivemall on Spark, including loading data, building a model, and making predictions.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsDatabricks
This document summarizes Daniel Galvez's presentation on creating The People's Speech Dataset using Apache Spark and TPUs. The key points are:
1) The dataset aims to provide 86,000 hours of speech data with forced alignments between audio and transcripts in order to be challenging, free to use, and have a commercial license.
2) The conceptual workload is to take hour-long audio files, split them into 15 second segments, and use a pretrained speech recognition model to discover when each word in the transcript was said.
3) Creating the dataset encountered limitations with accelerator-aware scheduling in Spark, memory issues with PySpark UDFs, crashes in TPUs, and the need to reorder data by
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
In the past several years, the pandas UDFs are perhaps the most important changes to Apache Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Apache Spark 3.0, the pandas UDFs were redesigned by leveraging type hints.
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
This document discusses using Apache Drill and business intelligence (BI) tools to analyze network data stored in Hadoop. It provides examples of querying network packet captures and APIs directly using SQL without needing to transform or structure the data first. This allows gaining insights into issues like dropped sensor readings by analyzing packets alongside other data sources. The document concludes that SQL-on-Hadoop technologies allow network analysis to be done in a BI context more quickly than traditional specialized tools.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
This document discusses using Apache Spark and Amazon DSSTNE to generate product recommendations at scale. It summarizes that Amazon uses Spark and Zeppelin notebooks to allow data scientists to develop queries in an agile manner. Deep learning jobs are run on GPUs using Amazon ECS, while CPU jobs run on Amazon EMR. DSSTNE is optimized for large sparse neural networks and allows defining networks in a human-readable JSON format to efficiently handle Amazon's large recommendation problems.
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
This document discusses using Stanford NLP and Spark to extract relationships from unstructured text. It presents a pipeline for annotating entities in oil and gas supply chain text using NER, extracting relationships using pattern matching, and simplifying sentences. The pipeline is implemented using Spark for scalability and fault tolerance. Benefits of the approach include code reuse between batch and streaming layers and easy distribution of NLP processing.
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
This talk will focus on Journey of technical challenges, trade offs and ground-breaking achievements for building performant and scalable pipelines from the experience working with our customers.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
HPC in Barcelona is centered around the MareNostrum supercomputer and BSC's 425-person team from 40 countries. MareNostrum allows simulation and analysis in fields like life sciences, earth sciences, and engineering. To meet new demands of big data analytics, BSC developed the Spark4MN module to run Spark workloads on MareNostrum. Benchmarking showed Spark4MN achieved good speed-up and scale-out. Further work profiles Spark using BSC tools and benchmarks workloads like image analysis on different hardware. BSC's vision is to advance understanding through technologies like cognitive computing and deep learning.
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB
This document discusses analyzing flight data using MongoDB aggregation. It provides examples of aggregation pipelines to group, match, project, sort, unwind and other stages. It explores questions about major carriers, airport cancellations, delays by distance and carrier. It also discusses visualizing route data and hub airports. Finally, it proposes a quiz on analyzing NYC flight data by importing data and performing queries on origins, cancellations, delays and weather impacts by month between the three major NYC airports.
Creating a Modern Data Architecture for Digital TransformationMongoDB
By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
Back to Basics Webinar 3: Introduction to Replica SetsMongoDB
This document provides an introduction to MongoDB replica sets, which allow for data redundancy and high availability. It discusses how replica sets work, including the replica set life cycle and how applications should handle writes and queries when using a replica set. Specifically, it explains that the MongoDB driver is responsible for server discovery and monitoring, retry logic, and handling topology changes in a replica set to provide a consistent view of the data to applications.
The document discusses MongoDB's Aggregation Framework, which allows users to perform ad-hoc queries and reshape data in MongoDB. It describes the key components of the aggregation pipeline including $match, $project, $group, $sort operators. It provides examples of how to filter, reshape, and summarize document data using the aggregation framework. The document also covers usage and limitations of aggregation as well as how it can be used to enable more flexible data analysis and reporting compared to MapReduce.
Webinar: 10-Step Guide to Creating a Single View of your BusinessMongoDB
Organizations have long seen the value in aggregating data from multiple systems into a single, holistic, real-time representation of a business entity. That entity is often a customer. But the benefits of a single view in enhancing business visibility and operational intelligence can apply equally to other business contexts. Think products, supply chains, industrial machinery, cities, financial asset classes, and many more.
However, for many organizations, delivering a single view to the business has been elusive, impeded by a combination of technology and governance limitations.
MongoDB has been used in many single view projects across enterprises of all sizes and industries. In this session, we will share the best practices we have observed and institutionalized over the years. By attending the webinar, you will learn:
- A repeatable, 10-step methodology to successfully delivering a single view
- The required technology capabilities and tools to accelerate project delivery
- Case studies from customers who have built transformational single view applications on MongoDB.
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
Details the first ever Exabyte-scale system that can hold a Trillion large files. Describes MapR's Distributed NameNode (tm) architecture, and how it scales very easily and seamlessly. Shows map-reduce performance across a variety of benchmarks like dfsio, pig-mix, nnbench, terasort and YCSB.
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
This document provides an overview of an introduction to NoSQL webinar. It discusses why NoSQL databases were created, the different types of NoSQL databases including key-value stores, column stores, graph stores, multi-model databases and document stores. It provides details on MongoDB, describing how MongoDB stores data as JSON-like documents with dynamic schemas and supports features like indexing, aggregation and geospatial queries. The webinar agenda is also outlined.
Webinar: Working with Graph Data in MongoDBMongoDB
With the release of MongoDB 3.4, the number of applications that can take advantage of MongoDB has expanded. In this session we will look at using MongoDB for representing graphs and how graph relationships can be modeled in MongoDB.
We will also look at a new aggregation operation that we recently implemented for graph traversal and computing transitive closure. We will include an overview of the new operator and provide examples of how you can exploit this new feature in your MongoDB applications.
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB
The United States will be deploying 16,000 traffic speed monitoring sensors - 1 on every mile of US interstate in urban centers. These sensors update the speed, weather, and pavement conditions once per minute. MongoDB will collect and aggregate live sensor data feeds from roadways around the country, support real-time queries from cars on traffic conditions on their route as well as be the platform for real-time dashboards displaying traffic conditions and more complex analytical queries used to identify traffic trends. In this session, we’ll implement a few different data aggregation techniques to query and dashboard the metrics gathered from the US interstate.
Back to Basics: My First MongoDB ApplicationMongoDB
- The document is a slide deck for a webinar on building a basic blogging application using MongoDB.
- It covers MongoDB concepts like documents, collections and indexes. It then demonstrates how to install MongoDB, connect to it using the mongo shell, and insert documents.
- The slide deck proceeds to model a basic blogging application using MongoDB, creating collections for users, articles and comments. It shows how to query, update, and import large amounts of seeded data.
The document discusses several scenarios for using Hadoop and MapR to address large-scale data and file management challenges. It describes how MapR improves upon Hadoop through innovations like containers, volumes, and transactional snapshots. It also provides examples of how MapR could be used to solve problems involving billions of files, global data distribution, real-time data processing, model deployment, video repositories, and backups.
Cisco connect toronto 2015 big data sean mc keownCisco Canada
The document provides an overview of big data concepts and architectures. It discusses key topics like Hadoop, HDFS, MapReduce, NoSQL databases, and MPP relational databases. It also covers network design considerations for big data, common traffic patterns in Hadoop, and how to optimize performance through techniques like data locality and quality of service policies.
Brad Anderson from MapR Technologies presented on technologies for interactive analysis (Apache Drill) and stream processing (Storm) beyond traditional batch processing with Hadoop/MapReduce. Drill allows interactive queries over large datasets through its columnar storage and distributed query engine. Storm is a framework for real-time computation over streaming data through topologies of processing components. M7 provides a more reliable and higher performance alternative to HBase through its unified storage and simplified architecture with no external daemons.
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
Slides from: http://www.meetup.com/Hadoop-NYC/events/34411232/
There are a number of assumptions that come with using standard Hadoop that are based on Hadoop's initial architecture. Many of these assumptions can be relaxed with more advanced architectures such as those provided by MapR. These changes in assumptions have ripple effects throughout the system architecture. This is significant because many systems like Mahout provide multiple implementations of various algorithms with very different performance and scaling implications.
I will describe several case studies and use these examples to show how these changes can simplify systems or, in some cases, make certain classes of programs run an order of magnitude faster.
About the speaker: Ted Dunning - Chief Application Architect (MapR)
Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at one of the very first Hadoop User Group meetings.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
This document provides an overview of Apache Hadoop, a framework for storing and processing large datasets in a distributed computing environment. It discusses what big data is and the challenges of working with large datasets. Hadoop addresses these challenges through its two main components: the HDFS distributed file system, which stores data across commodity servers, and MapReduce, a programming model for processing large datasets in parallel. The document outlines the architecture and benefits of Hadoop for scalable, fault-tolerant distributed computing on big data.
Offline processing with Hadoop allows for scalable, simplified batch processing of large datasets across distributed systems. It enables increased innovation by supporting complex analytics over large data sets without strict schemas. Hadoop adoption is moving beyond legacy roles to focus on data processing and value creation through scalable and customizable systems like Cascading.
This document discusses distributed computing and Hadoop. It begins by explaining distributed computing and how it divides programs across several computers. It then introduces Hadoop, an open-source Java framework for distributed processing of large data sets across clusters of computers. Key aspects of Hadoop include its scalable distributed file system (HDFS), MapReduce programming model, and ability to reliably process petabytes of data on thousands of nodes. Common use cases and challenges of using Hadoop are also outlined.
This document provides an introduction to Hadoop and big data. It discusses the new kinds of large, diverse data being generated and the need for platforms like Hadoop to process and analyze this data. It describes the core components of Hadoop, including HDFS for distributed storage and MapReduce for distributed processing. It also discusses some of the common applications of Hadoop and other projects in the Hadoop ecosystem like Hive, Pig, and HBase that build on the core Hadoop framework.
Have you ever heard the buzzword "big data"? Big data is briefly described to collect massive amounts of data and extract all the small details and larger trends that are available. Summarize the output and generate important insight about customers and competitors.
Enterprises seem to have sensed that something is in the air and have started to shop technology. So what has the world to offer for enterprises that have an unknown amount of petabytes flowing through their systems on a daily basis? There are a few options, but really few that can catch up with the popularity of Hadoop. Hadoop can store and process large amounts of data. It has a large and diverse toolset for integrations, operations and processing and it is open source!
1) Big data is growing exponentially and new frameworks like Hadoop are needed to analyze large, unstructured datasets.
2) Hadoop uses distributed computing and storage across commodity servers to provide scalable and cost-effective analytics. It leverages local disks on each node for temporary data to improve performance.
3) Virtualizing Hadoop simplifies operations, enables mixed workloads, and provides high availability through features like vMotion and HA. It also allows for elastic scaling of compute and storage resources.
This document discusses distributed data processing using MapReduce and Hadoop in a cloud computing environment. It describes the need for scalable, economical, and reliable distributed systems to process petabytes of data across thousands of nodes. It introduces Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce. Key aspects of Hadoop discussed include its core components HDFS for distributed file storage and MapReduce for distributed computation.
The document discusses backup and disaster recovery strategies for Hadoop. It focuses on protecting data sets stored in HDFS. HDFS uses data replication and checksums to protect against disk and node failures. Snapshots can protect against data corruption and accidental deletes. The document recommends copying data from the primary to secondary site for disaster recovery rather than teeing, and discusses considerations for large data movement like bandwidth needs and security. It also notes the importance of backing up metadata like Hive configurations along with core data.
The document summarizes Quantcast File System (QFS), an alternative to HDFS that provides petabyte storage at half the disk space of HDFS. QFS offers significantly faster I/O than HDFS through the use of Reed-Solomon encoding, requiring only 1.5x disk space compared to HDFS's 3x. It has been production hardened at Quantcast under massive processing loads and is fully compatible with Apache Hadoop. Benchmark results show QFS writes are half the disk I/O of HDFS writes and reads require accessing the network versus HDFS's focus on data locality.
This document provides an overview of big data and Hadoop. It discusses what big data is, why it has become important recently, and common use cases. It then describes how Hadoop addresses challenges of processing large datasets by distributing data and computation across clusters. The core Hadoop components of HDFS for storage and MapReduce for processing are explained. Example MapReduce jobs like wordcount are shown. Finally, higher-level tools like Hive and Pig that provide SQL-like interfaces are introduced.
Hadoop makes data storage and processing at scale available as a lower cost and open solution. If you ever wanted to get your feet wet but found the elephant intimidating fear no more.
We will explore several integration considerations from a Windows application prospective like accessing HDFS content, writing streaming jobs, using .NET SDK, as well as HDInsight on premise or on Azure.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
Similar to Seattle Scalability Meetup - Ted Dunning - MapR (20)
Abstract: From Billionaires to Moms, some people care deeply about preserving their family photos, this lightning talk about Brad Fitzpatrick's 20 percent time talk at Google will talk about the projects aim, capabilities and technologies (GoLang, Android). How they marry up with NYC focus on Blockchain technologies.
This document provides an overview of Riak TS, Basho's new purpose-built time series database. It describes Riak TS's key features like high write throughput, efficient range query support, and horizontal scalability. It also outlines Riak TS's data modeling approach of co-locating and partitioning time-series data, its SQL-like query language, and provides examples of its performance and roadmap. Finally, it demonstrates a potential use case application called UNCORKD for tracking wine check-ins and reviews.
The document discusses connecting local food producers to consumers through a peer-to-peer traceability system. It notes the demand for local, traceable produce from consumers and the liability issues producers face without traceability. The proposed solution is an open-source app that allows producers and consumers to search for and transact locally sourced food through a hub-and-spoke model, building traceability through ordered transactions.
This document proposes a personal databank application to help users manage their private memories and digital assets throughout different life stages. It notes that people currently lack control over their personal data, which is often collected and used by large tech companies. The proposed solution is a personal cloud application that feels like a trusted friend or attorney, allows users to organize their memories and put their affairs in order, and promises never to sell personal data or compromise privacy. It presents an MVP, roadmap, and business model focused on freemium pricing to attract users before offering premium services.
Seattle Scalability meetup intro slides, Jan 22, 2014clive boulton
The document summarizes an upcoming Seattle Scalability and Distributed Systems Meetup on January 22, 2014. The meetup will feature main sessions on Koverse, a data unification platform, and Samza, a new distributed stream processing framework developed at LinkedIn. It will also include community announcements and the potential for an after-beer social. Attendees are encouraged to use the hashtag #SeaScale.
The meetup agenda included two main sessions on scaling SQL databases from scratch using Hadoop and scaling DevOps for websites operating at large scale. There would also be community announcements, an after-beer social at a local pub, and the event was using the hashtag #SeaScale on social media. The meetup was hosted in Seattle on December 4, 2013.
Seattle scalability meetup intro slides 23 oct 2013clive boulton
The document announces a Seattle Scalability and Distributed Systems Meetup on October 23, 2013 that will feature two main sessions - one on Cloudera Search which uses SolrCloud and Hadoop to allow non-technical users to explore and analyze Hadoop data, and another on experiences with Amazon RDS (MySQL) including benefits and challenges. It also lists the after-beer location and provides information on submitting talks and a future GraphLab Notebook meetup.
Seattle scalability meetup intro slides 24 july 2013clive boulton
The document summarizes an upcoming Seattle Scalability and Distributed Systems Meetup on July 24, 2013. It will include main sessions from Simply Measured and SpaceCurve about migrating to HBase and building a new big data platform, respectively. After the sessions there will be community announcements and an after-beer at the Frontier Room. Suggestions are also sought for future technical talks about production-grade big systems.
Seattle Scalability Meetup intro pptx - June 26clive boulton
This document provides an agenda for the Seattle Scalability and Distributed Systems Meetup on June 26, 2013. The main sessions will discuss best practices for developing scalable applications on Google Cloud Platform and how to build many-core scalable software systems using Concurix. There will also be community announcements, an after-beer social at Rock Bottom, and the hashtag #SeaScale for the event.
Seattle scalability meetup intro ppt May 22clive boulton
The document summarizes an upcoming Seattle Scalability and Distributed Systems Meetup on May 22, 2013. The meetup will include main sessions from Atigeo on their big data platform xPatterns and from GraphLab on their highly scalable machine learning algorithms. There will also be community announcements and an after-beer event at a nearby restaurant. Suggestions for future technical talks on production-grade big systems are welcomed.
This document discusses challenges around patents, intellectual property, and patent trolls as it relates to VRM (Vendor Relationship Management). It notes that large tech companies obtain many patents through their legal teams, and that patents can bolster startup valuations but also allow failed companies to assert patent claims. The document outlines technical, legal, social, usability, business model, and other challenges around VRM and sharing user data and controls. It suggests building on cloud platforms or a personal cloud and obtaining patents for defense. It also discusses challenges around large company vs. startup legal battles and dealing with patent trolls.
Seattle scalability meetup March 27,2013 intro slidesclive boulton
The document summarizes an upcoming meetup about scalability and distributed systems in Seattle. The meetup will include main sessions on Hortonworks and HBase application development by Nick Dimiduk and Saffron's brain-like analytics by Paul Hofmann. There will also be community announcements, an after-beer at a nearby restaurant, and the hashtag #seascale for the event.
The document summarizes an upcoming meetup on scalability and distributed systems in Seattle. It includes two main sessions, one by Braintree on achieving high availability in their Ruby on Rails application, and one by Cloudant on scaling geospatial queries across a distributed database. It also lists the location for an after-meetup social at a local bar and contact information for proposing future talk ideas.
Seattle Scalability Meetup | Accumulo and WhitePagesclive boulton
The meetup agenda included community announcements, two main sessions about Accumulo and WhitePages, and an after-beer social at Rock Bottom Brewery. Paul Brown from Koverse would discuss Accumulo, an Apache project for its design, unique features, and comparisons. Scott Sikora from WhitePages would discuss how they developed a daily SOLR index of every business in the US using Hadoop, Pig, and a custom job control system. The meetup encouraged technical talks about large, production-grade systems.
This document summarizes an upcoming Seattle Scalability Meetup event. It will feature talks on Cloudant (CouchDB), SEOMoz's Linkscape (now Mozscape) tool, and Microsoft's investments in Apache Hadoop, SQL Server, and big data jobs. The event will be held at a Microsoft venue, with pizza sponsored by Hortonworks and drinks by eSage. It encourages submissions of in-depth technical talks on production-grade big systems. The post-talk social is at a nearby bar.
The document announces a Seattle monthly meetup about Hadoop, Scalability, and NoSQL topics. The meetup will include community announcements, a main session by Matt Schumpert from Datameer on unlocking the power of Hadoop using their product, and socializing over beer afterwards. Gary and Greg from Clipboard will also give a talk on why they use Riak to support search of data in time-based order for high performance without mapreduce.
The document discusses leveraging a polyglot Platform-as-a-Service approach to address legacy systems and enable a transition to cloud technologies. It proposes exposing legacy data through APIs for mobile and social applications while retaining the existing on-premise systems. The approach would use distributed databases like CouchDB hosted on Cloudant to provide scalability and offline replication capabilities. This would allow developing new applications independently of legacy systems while providing a path for incremental transformation over time.
Whole Chain Traceability, pulling a Kobayashi Maru. clive boulton
Whole Chain Traceability Consortium - needs to improve one-up / one-down information sharing. Pulling a Kobayashi Maru through enterprise connected consumers polyglot technologies for design, development of Whole Chain Traceability application.
This document discusses improving one-up/one-down information sharing between companies in the supply chain. It proposes focusing on the user experience through a neutral data store with an open API and reference applications. This would allow connecting enterprises to consumers using polyglot technologies and elastic cloud infrastructure. The goal is to build trust and adoption by making it easy for others to use and develop on top of the system.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
2. Agenda
• Lightning talks / community announcements
• Main Speaker
• Bier @ Feierabend - 422 Yale Ave North
• Hashtags #Seattle #Hadoop
3. Fast & Frugal: Running a Lean Startup
with AWS – Oct 27th 10am-2pm
http://aws.amazon.com/about-aws/events/
4. Seattle AWS User Group November
9th, 2011 – 6:30 -9pm
• November we're going to hear from Amy
Woodward from EngineYard about keeping
your systems live through outages and other
problems using EngineYard atop AWS. Come
check out this great talk and learn a thing or
three about EngineYard& keeping high
availability for your systems!
• http://www.nwcloud.org
5. www.mapr.com
• MapR is an amazing new distributed
filesystem modeled after Hadoop. It maintains
API compatibility with Hadoop, but far
exceeds it in performance, manageability, and
more.
9. For startups
• History is always small
• The future is huge
• Must adopt new technology to survive
• Compatibility is not as important
– In fact, incompatibility is assumed
10. Physics of large companies
Absolute growth
still very large
Startup
phase
11. For large businesses
• Present state is always large
• Relative growth is much smaller
• Absolute growth rate can be very large
• Must adopt new technology to survive
– Cautiously!
– But must integrate technology with legacy
• Compatibility is crucial
12. The startup technology picture
No compatibility
requirement
Old computers
and software
Expected hardware
and software growth
Current computers
and software
13. The large enterprise picture
Must work
together
?
Current hardware
and software
Proof of concept
Hadoop cluster
Long-term Hadoop
cluster
14. What does this mean?
• Hadoop is very, very good at streaming
through things in batch jobs
• Hbase is good at persisting data in very write-
heavy workloads
• Unfortunately, the foundation of both systems
is HDFS which does not export or import well
15. Narrow Foundations
Big data is Pig Hive
Web Services and
heavy
expensive to
move.
Sequential File Map/
OLAP OLTP Hbase
Processing Reduce
RDBMS NAS HDFS
16. Narrow Foundations
• Because big data has inertia, it is difficult to
move
– It costs time to move
– It costs reliability because of more moving parts
• The result is many duplicate copies
17. One Possible Answer
• Widen the foundation
• Use standard communication protocols
• Allow conventional processing to share with
parallel processing
18. Broad Foundation
Pig Hive
Web Services
Sequential File Map/
OLAP OLTP Hbase
Processing Reduce
RDBMS NAS HDFS
MapR
19. Broad Foundation
• Having a broad foundation allows many kinds
of computation to work together
• It is no longer necessary to throw data over a
wall
• Performance much higher for map-reduce
• Enterprise grade feature sets such as
snapshots and mirrors can be integrated
• Operations more familiar to admin staff
21. Map-reduce key details
• User supplies f1 (map) and f2 (reduce)
– Both are pure functions, no side effect
• Framework supplies input, shuffle, output
• Framework will re-run f1 and f2 on failure
• Redundant task completion is OK
23. Map-Reduce
f1 Local f2
Disk
Input Output
f1 Local f2
Disk
f1
24. Example – WordCount
• Mapper
– read line, tokenize into words
– emit (word, 1)
• Reducer
– read (word, [k1, … , kn])
– Emit (word, Σki)
25. Example – Map Tiles
• Input is set of objects
– Roads (polyline)
– Towns (polygon)
– Lakes (polygon)
• Output is set of map-tiles
– Graphic image of part of map
26. Bottlenecks and Issues
• Read-only files
• Many copies in I/O path
• Shuffle based on HTTP
– Can’t use new technologies
– Eats file descriptors
• Spills go to local file space
– Bad for skewed distribution of sizes
27. MapR Areas of Development
HBase Map
Reduce
Ecosystem
Storage Management
Services
28. MapR Improvements
• Faster file system
– Fewer copies
– Multiple NICS
– No file descriptor or page-buf competition
• Faster map-reduce
– Uses distributed file system
– Direct RPC to receiver
– Very wide merges
29. MapR Innovations
• Volumes
– Distributed management
– Data placement
• Read/write random access file system
– Allows distributed meta-data
– Improved scaling
– Enables NFS access
• Application-level NIC bonding
• Transactionally correct snapshots and mirrors
30. MapR'sContainers
Files/directories are sharded into blocks, which
are placed into mini NNs (containers ) on disks
Each container contains
Directories & files
Data blocks
Replicated on servers
Containers are 16-
No need to manage
32 GB segments of
directly
disk, placed on
nodes
31. MapR'sContainers
Each container has a
replication chain
Updates are transactional
Failures are handled by
rearranging replication
32. Container locations and replication
N1, N2 N1
N3, N2
N1, N2
N1, N3 N2
N3, N2
CLDB
N3
Container location database
(CLDB) keeps track of nodes
hosting each container and
replication chain order
33. MapR Scaling
Containers represent 16 - 32GB of data
Each can hold up to 1 Billion files and directories
100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
25GB to cache all containers for 2EB cluster
But not necessary, can page to disk
Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
Serve 100x more data-nodes
Increase container size to 64G to serve 4EB cluster
Map/reduce not affected
34. MapR's Streaming Performance
2250 2250
11 x 7200rpm SATA 11 x 15Krpm SAS
2000 2000
1750 1750
1500 1500
1250 1250 Hardware
MapR
1000 1000
MB Hadoop
750 750
per
sec 500 500
250 250
0 0
Read Write Read Write
Higher is better
Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB
35. Terasort on MapR
10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
60 300
50 250
40 200
Elapsed 150
MapR
30
time Hadoop
(mins) 20 100
10 50
0 0
1.0 TB 3.5 TB
Lower is better
36. HBase on MapR
YCSB Random Read with 1 billion 1K records
10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
25000
20000
Records 15000
per MapR
second 10000 Apache
5000
0
Zipfian Uniform Higher is better
37. Small Files (Apache Hadoop, 10 nodes)
Out of box
Op: - create file
Rate (files/sec)
- write 100 bytes
Tuned - close
Notes:
- NN not replicated
- NN uses 20G DRAM
- DN uses 2G DRAM
# of files (m)
38. MUCH faster for some operations
Same 10 nodes …
Create
Rate
# of files (millions)
39. What MapR is not
• Volumes != federation
– MapR supports > 10,000 volumes all with
independent placement and defaults
– Volumes support snapshots and mirroring
• NFS != FUSE
– Checksum and compress at gateway
– IP fail-over
– Read/write/update semantics at full speed
• MapR != maprfs
40. Not Your Father’s NFS
• Multiple architectures possible
• Export to the world
– NFS gateway runs on selected gateway hosts
• Local server
– NFS gateway runs on local host
– Enables local compression and check summing
• Export to self
– NFS gateway runs on all data nodes, mounted
from localhost
41. Export to the world
NFS
NFS
Server
NFS
Server
NFS
Server
NFS Server
Client
42. Local server
Application
NFS
Server
Client
Cluster
Nodes
46. Sharded textIndex text to local disk
Indexing
Assign documents
to shards and then copy index to
distributed file store
Clustered
Reducer index storage
Input Map
documents
Copy to local disk
Local
typically disk
required before Local Search
index can be loaded disk Engine
47. Shardedtext indexing
• Mapper assigns document to shard
– Shard is usually hash of document id
• Reducer indexes all documents for a shard
– Indexes created on local disk
– On success, copy index to DFS
– On failure, delete local files
• Must avoid directory collisions
– can’t use shard id!
• Must manage and reclaim local disk space
48. Conventional data flow
Failure of search
engine requires
Failure of a reducer another download
causes garbage to of the index from
accumulate in the clustered storage.
Clustered
local disk Reducer index storage
Input Map
documents
Local
disk Local Search
disk Engine
49. Simplified NFS data flows
Search
Engine
Reducer
Input Map Clustered
documents
index storage
Failure of a reducer Search engine
is cleaned up by reads mirrored
map-reduce index directly.
framework
50. Simplified NFS data flows
Search
Mirroring allows Engine
exact placement
of index data
Reducer
Input Map
documents Search
Engine
Aribitrary levels
of replication
also possible Mirrors
52. K-means
• Classic E-M based algorithm
• Given cluster centroids,
– Assign each data point to nearest centroid
– Accumulate new centroids
– Rinse, lather, repeat
53. K-means, the movie
Centroids
I
n Assign Aggregate
p to new
u Nearest centroids
t centroid
57. Old tricks, new dogs
Read from local disk
• Mapper from distributed cache
– Assign point to cluster
Read from
– Emit cluster id, (1, point) HDFS to local disk
• Combiner and reducer by distributed cache
– Sum counts, weighted sum of points
– Emit cluster id, (n, sum/n) Written by
• Output to HDFS map-reduce
58. Old tricks, new dogs
• Mapper
– Assign point to cluster Read
from
– Emit cluster id, (1, point) NFS
• Combiner and reducer
– Sum counts, weighted sum of points
– Emit cluster id, (n, sum/n) Written by
map-reduce
• Output to HDFS
MapR FS
59. Poor man’s Pregel
• Mapper
while not done:
read and accumulate input models
for each input:
accumulate model
write model
synchronize
reset input format
emit summary
• Lines in bold can use conventional I/O via NFS
60
60. Click modeling architecture
Side-data
Now via NFS
I
Feature
n Sequential
extraction Data
p SGD
and join
u Learning
down
t
sampling
Map-reduce
61. Click modeling architecture
Side-data
Map-reduce
cooperates Sequential
with NFS SGD
Learning
Sequential
SGD
I Learning
Feature
n Sequential
extraction Data
p SGD
and join
u Learning
down
t
sampling Sequential
SGD
Learning
Map-reduce Map-reduce
67. Trivial visualization interface
• Map-reduce output is visible via NFS
$R
> x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)
> plot(error ~ t, x)
> q(save=„n‟)
• Legacy visualization just works
68. Conclusions
• We used to know all this
• Tab completion used to work
• 5 years of work-arounds have clouded our
memories
• We just have to remember the future
Editor's Notes
Constant time implies constantfactor of growth. Thus the accumulation of all of history before 10 time units ago is less than half the accumulation in the last 10 units alone. This is true at all time.
Startups use this fact to their advantage and completely change everything to allow time-efficient development initially with conversion to computer-efficient systems later.
Here the later history is shown after the initial exponential growth phase. This changes the economics of the company dramatically.
The startup can throw away history because it is so small. That means that the startup has almost no compatibility requirement because the data lost due to lack of compatibility is a small fraction of the total data.
A large enterprise cannot do that. They have to have access to the old data and have to share between old data and Hadoop accessible data.This doesn’t have to happen with the proof of concept level, but it really must happen when hadoop first goes to production.
But stock Hadoop does not handle this well.
This is because Hadoop and other data silos have different foundations. What is worse, there is a semantic wall that separates HDFS from normal resources.
Here is a picture that shows how MapR can replace the foundation and provide compatibility. Of course, MapR provide much more than just the base, but the foundation is what provides the fundamental limitation or lack of limit in MapR’s case.