Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production

•Download as PPTX, PDF•

3 likes•595 views

This document discusses the top 5 mistakes to avoid when using Apache Spark in production environments. It identifies the most common issues as: 1) scaling of the architecture to larger datasets and cluster sizes, 2) insufficient memory configurations for drivers and executors, 3) errors in end user code, 4) incompatible dependencies causing classpath issues, and 5) administration and operation-related problems around resource management, security, and configurations. Remedies for each type of issue are provided.

Engineering

1© Cloudera, Inc. All rights reserved.
Neelesh Srinivas Salian
Breaking Spark
Top 5 Mistakes to avoid when
using Apache Spark in
Production

2© Cloudera, Inc. All rights reserved.
Outline
• Introduction to Spark
• Information gathering
• Classification of the Issues
• Description
• Remedies
• Q&A

4© Cloudera, Inc. All rights reserved.
Information Gathering
Regular Spark Applications and
memory configuration problems
Problems in the Ingest pipelines
Problems in interacting with
Kafka
Usually Resource Management
issues
Spark Core Spark Streaming Spark on YARN
Source:
• Cloudera Customer Opened Cases
• Posts in Cloudera Community Forums
• Spark use cases for the following:

5© Cloudera, Inc. All rights reserved.
Classification
1. Scaling of the Architecture
2. Memory Configurations
3. End user Code
4. Incompatible Dependencies
5. Administration/Operation related issues

6© Cloudera, Inc. All rights reserved.
1. Increase in the cluster Size
2. Utilization of the Cluster
1. Scaling of the Architecture

7© Cloudera, Inc. All rights reserved.
1. Understand the initial setup and current one
2. Look at logs / console to find the error
3. Determine parameters
4. Tune away
1. Scale of the Architecture: Larger Datasets

8© Cloudera, Inc. All rights reserved.
YARN
Remember:
Container Memory
• yarn.nodemanager.resource.memory-mb
Container Memory Maximum
• yarn.scheduler.maximum-allocation-mb
Container Memory Minimum
• yarn.scheduler.minimum-allocation-mb
2. Memory Configurations
Standalone
Java Heap Size of Worker in Bytes
• worker_max_heapsize
Java Heap Size of Master in Bytes
• master_max_heapsize

9© Cloudera, Inc. All rights reserved.
Symptoms:
Executor error: ERROR
util.SparkUncaughtExceptionHandler: Uncaught
exception in thread Thread[Executor task launch worker-
0,5,main]
java.lang.OutOfMemoryError: Java heap space
Driver error: ERROR actor.ActorSystemImpl: Uncaught
fatal error from thread [sparkDriver-akka.actor.default-
dispatcher-60] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
Cause:
The driver or executor is configured
with a memory limit that is too low for
the application being run.
2. Memory Configurations: Insufficient Memory

10© Cloudera, Inc. All rights reserved.
1. Understand Spark Configuration Order of Precedence
1. Spark configuration file: Parameters are passed to SparkConf.
2. Command-line arguments are passed to spark-submit, spark-shell, or pyspark.
3. Cloudera Manager UI properties set in the spark-defaults.conf configuration file.
2. Verify the Driver and Executor Memory Used
1. Environment tab in History Server
2. Check driver and executor memory usage values
3. Increase the Heap
Caveat: do not set the spark.driver.memory config using SparkConf , use - - driver-memory
• spark.driver.memory
• spark.executor.memory
2. Memory Configurations: Insufficient Memory

11© Cloudera, Inc. All rights reserved.
1. Why chose Spark code over
MapReduce?
1. Expensive Operations
1. Syntax / Command Errors
3. End User Code

12© Cloudera, Inc. All rights reserved.
1. Incorrectly built jars
2. Version mismatch
4. Incompatible Dependencies

13© Cloudera, Inc. All rights reserved.
Symptoms
Error: Could not find or load main class
org.apache.spark.deploy.yarn.Application
Master
java.lang.ClassNotFoundException:
org.apache.hadoop.mapreduce.MRJobCon
fig
java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/FSDataInputStrea
m
ImportError: No module named
qt_returns_summary_wrapper
4. Incompatible Dependencies: Classpath Issues

14© Cloudera, Inc. All rights reserved.
What to Do?
1. Verify if an application is using
classes that are not part of the
framework (Hadoop/Spark)
1. Verify that an application is built
using correct artifacts.
4. Incompatible Dependencies: Classpath Issues

15© Cloudera, Inc. All rights reserved.
1. Management of Cluster
2. User-related issues
3. Security/Permissions
4. No Configurations setup
5. Administration/Operation Related issues

16© Cloudera, Inc. All rights reserved.
• Learn from the Mistakes
• Be prepared
• References:
• Tuning Spark Blog Cloudera
• Tuning YARN Cloudera
• Spark Documentation
Summary

17© Cloudera, Inc. All rights reserved.
• Wilfred
• Paula
• Bjorn
• Mark
• Anand
• Sean
• Attila
Credits

18© Cloudera, Inc. All rights reserved.
Thank you
Neelesh Srinivas Salian
LinkedIn
Email: neeleshssalian@gmail.com
Twitter: NeelS7

This document provides tips and best practices for optimizing Apache Spark performance and resource allocation. It discusses: - The components of Spark including executors, drivers, and tasks - Configuring Spark on YARN and dynamic resource allocation - Optimizing memory usage, avoiding data skew, and reducing serialization costs - Best practices for Spark Streaming around microbatching, fault tolerance, and performance - Recommendations for running Spark on cloud object stores like S3

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Cloudera, Inc.

The document discusses 5 common mistakes people make when writing Spark applications: 1) Not properly sizing executors for memory and cores. 2) Having shuffle blocks larger than 2GB which can cause jobs to fail. 3) Not addressing data skew which can cause joins and shuffles to be very slow. 4) Not properly managing the DAG to minimize shuffles and stages. 5) Classpath conflicts from mismatched dependencies causing errors.

Top 5 Mistakes When Writing Spark Applications

Spark Summit

This document discusses 5 common mistakes when writing Spark applications: 1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources. 2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this. 3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew. 4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible. 5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh

Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...

Spark Summit

This document discusses building an on-premise analytics pipeline using Spark. It summarizes challenges faced including sharing a cluster for different environments, orchestrating multi-step jobs, performance concerns due to resource contention, debugging issues, and lack of development tooling. Solutions proposed include using Docker to isolate environments, Luigi for job orchestration, optimizing resource allocation, logging to Graylog, and developing custom tools. The next steps suggested are moving to the cloud to simplify development and enable broader insights.

Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas

Spark Summit

Flintrock is a new tool for launching and managing Spark clusters on AWS EC2. It aims to be faster and have better user experience than the existing spark-ec2 tool. Key features include obsessive focus on speed (can launch a 100 node cluster in under 5 minutes vs over an hour for spark-ec2), easy configuration via persisted files, and support for Linux, OS X, and potentially Windows in the future. While still in beta, Flintrock has already been used to launch clusters of over 200 nodes.

700 Updatable Queries Per Second: Spark as a Real-Time Web Service

Evan Chan

YARN

Alex Moundalexis

Low Latency Execution For Apache Spark

Jen Aman

This document describes Drizzle, a low latency execution engine for Apache Spark. It addresses the high overheads of Spark's centralized scheduling model by decoupling execution from scheduling through batch scheduling and pre-scheduling of shuffles. Microbenchmarks show Drizzle achieves milliseconds latency for iterative workloads compared to hundreds of milliseconds for Spark. End-to-end experiments show Drizzle improves latency for streaming and machine learning workloads like logistic regression. The authors are working on automatic batch tuning and an open source release of Drizzle.

This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.

Why Your Apache Spark Job is Failing

Cloudera, Inc.

The document discusses Spark job failures and Spark/YARN architecture. It describes a Spark job failure due to a task failing 4 times with a NumberFormatException when parsing a string. It then explains that Spark jobs are divided into stages made up of tasks, and the entire job fails if a stage fails. The document also provides an overview of the Spark and YARN architectures, showing how Spark jobs are submitted to and run via the YARN resource manager.

Spark tuning

GMO-Z.com Vietnam Lab Center

The document discusses tuning Spark parameters to optimize performance. It describes how to control Spark's resource usage through parameters like num-executors, executor-cores, and executor-memory. Advanced parameters like spark.shuffle.memoryFraction and spark.reducer.maxSizeInFlight are also covered. Dynamic allocation allows scaling resources up and down based on workload. Tips provided include tuning memory usage, choosing serialization and storage levels, setting parallelism, and avoiding operations like groupByKey. An example recommends tuning the collaborative filtering algorithm in the RW project, reducing runtime from 27 minutes to under 7 minutes.

Cloudera Impala

Alex Moundalexis

Re-Architecting Spark For Performance Understandability

Jen Aman

The document describes a new architecture called "monotasks" for Apache Spark that aims to make reasoning about Spark job performance easier. The monotasks architecture decomposes Spark tasks so that each task uses only one resource (e.g. CPU, disk, network). This avoids issues where Spark tasks bottleneck on different resources over time or experience resource contention. With monotasks, dedicated schedulers control resource contention and monotask timing data can be used to model ideal performance. Results show monotasks match Spark's performance and provide clearer insight into bottlenecks.

Compression talk

Ilya Ganelin

In the engineering world, we don’t always have the luxury of owning our data pipelines end to end. If only we could influence those outside components… Well, we tried, and this our story - replete with failure, discovery, and the serenity of enlightenment. Join us on our journey as we learned more than we ever wanted to know about compression in different Apache projects, deployed our own ingestion pipeline in Apache Flume, and ultimately unified these in a robust framework built on Apache Apex handling 1 TB of data per day. We end with some reflections on the joys and tribulations of the open source realm and some key lessons for other large applications atop multiple Apache solutions.

A Developer’s View into Spark's Memory Model with Wenchen Fan

Databricks

As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.

Solving low latency query over big data with Spark SQL

Julien Pierre

This document provides an overview of client data, capabilities, and architecture for a data analytics platform. It discusses data size and query latency, processing and storage using Cosmos, SparkSQL and HDFS, a Mesos cluster architecture with Zookeeper, and interactive analytics using Zeppelin and Avocado notebooks. The platform aims to provide a unified environment for data ingestion, transformation, storage, processing and analytics to enable intelligent data products and experiences.

Building a High-Performance Database with Scala, Akka, and Spark

Evan Chan

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...

Databricks

In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores. With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc. They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.

Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...

Spark Summit

Notebooks: they enable our users, but they can cripple our clusters. Let’s fix that. Notebooks have soared in popularity at companies world-wide because they provide an easy, user-friendly way of accessing the cluster-computing power of Spark. But the more users you have hitting a cluster, the harder it is to manage the cluster resources as big, long-running jobs start to starve out small, short-running jobs. While you could have users spin up EMR-style clusters, this reduces the ability to take advantage of the collaborative nature of notebooks. It also quickly becomes expensive as clusters sit idle for long periods of time waiting on single users. What we want is fair, efficient resource utilization on a large single cluster for a large number of users. In this talk we’ll discuss dynamic allocation and the best practices for configuring the current version of Spark as-is to help solve this problem. We’ll also present new improvements we’ve made to address this use case. These include: decommissioning executors without losing cached data, proactively shutting down executors to prevent starvation, and improving the start times of new executors.

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

Anya Bida

by Anya Bida and Rachel Warren from Alpine Data https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/ Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.

Scale-Out Using Spark in Serverless Herd Mode!

Databricks

PaaSTA: Autoscaling at Yelp

Nathan Handler

Capacity planning is a difficult challenge faced by most companies. If you have too few machines, you will not have enough compute resources available to deal with heavy loads. On the other hand, if you have too many machines, you are wasting money. This is why companies have started investing in automatically scaling services and infrastructure to minimize the amount of wasted money and resources. In this talk, Nathan will describe how Yelp is using PaaSTA, a PaaS built on top of open source tools including Docker, Mesos, Marathon, and Chronos, to automatically and gracefully scale services and the underlying cluster. He will go into detail about how this functionality was implemented and the design designs that were made while architecting the system. He will also provide a brief comparison of how this approach differs from existing solutions.

Deploying Accelerators At Datacenter Scale Using Spark

Jen Aman

Spark Summit EU talk by Jorg Schad

Spark Summit

This document discusses containers and virtual machines. It explains that containers provide a lightweight virtualization method that isolates applications but shares the host operating system kernel. Containers use resource isolation features like cgroups and namespaces to limit CPU, memory, storage, and networking usage. In contrast, virtual machines run their own full operating system and provide stronger isolation but are more resource intensive.

Understanding Memory Management In Spark For Fun And Profit

Spark Summit

1) The document discusses memory management in Spark applications and summarizes different approaches tried by developers to address out of memory errors in Spark executors. 2) It analyzes the root causes of memory issues like executor overheads and data sizes, and evaluates fixes like increasing memory overhead, reducing cores, frequent garbage collection. 3) The document dives into Spark and JVM level configuration options for memory like storage pool sizes, caching formats, and garbage collection settings to improve reliability, efficiency and performance of Spark jobs.

Top 5 mistakes when writing Spark applications

markgrover

Monitoring of GPU Usage with Tensorflow Models Using Prometheus

Databricks

Understanding the dynamics of GPU utilization and workloads in containerized systems is critical to creating efficient software systems. We create a set of dashboards to monitor and evaluate GPU performance in the context of TensorFlow. We monitor performance in real time to gain insight into GPU load, GPU memory and temperature metrics in a Kubernetes GPU enabled system. Visualizing TensorFlow training job metrics in real time using Prometheus allows us to tune and optimize GPU usage. Also, because Tensor flow jobs can have both GPU and CPU implementations it is useful to view detailed real time performance data from each implementation and choose the best implementation. To illustrate our system, we will show a live demo gathering and visualizing GPU metrics on a GPU enabled Kubernetes cluster with Prometheus and Grafana.

Apache Hadoop - A Deep Dive (Part 1 - HDFS)

Debarchan Sarkar

HDFS Deep Dive

Zoltan C. Toth

This document provides an overview of HDFS (Hadoop Distributed File System), including its design goals, architecture, key components, and some limitations. The main points are: HDFS is a distributed file system designed for large files and streaming data access across commodity hardware. It uses a master-slave architecture with a NameNode managing the file system metadata and DataNodes storing file data in blocks. Files are replicated across multiple DataNodes for fault tolerance. The NameNode controls permissions, file-block mappings, and DataNode locations and balances the cluster as needed.

Application architectures with Hadoop and Sessionization in MR

markgrover

What's hot

Top 5 mistakes when writing Spark applications

hadooparchbook

Why Your Apache Spark Job is Failing

Cloudera, Inc.

Spark tuning

GMO-Z.com Vietnam Lab Center

Cloudera Impala

Alex Moundalexis

Re-Architecting Spark For Performance Understandability

Jen Aman

Compression talk

Ilya Ganelin

A Developer’s View into Spark's Memory Model with Wenchen Fan

Databricks

Solving low latency query over big data with Spark SQL

Julien Pierre

Building a High-Performance Database with Scala, Akka, and Spark

Evan Chan

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...

Databricks

Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...

Spark Summit

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

Anya Bida

Scale-Out Using Spark in Serverless Herd Mode!

Databricks

PaaSTA: Autoscaling at Yelp

Nathan Handler

Deploying Accelerators At Datacenter Scale Using Spark

Jen Aman

Spark Summit EU talk by Jorg Schad

Spark Summit

Understanding Memory Management In Spark For Fun And Profit

Spark Summit

Top 5 mistakes when writing Spark applications

markgrover

Monitoring of GPU Usage with Tensorflow Models Using Prometheus

Databricks

What's hot (19)

Top 5 mistakes when writing Spark applications

Why Your Apache Spark Job is Failing

Spark tuning

Cloudera Impala

Re-Architecting Spark For Performance Understandability

Compression talk

A Developer’s View into Spark's Memory Model with Wenchen Fan

Solving low latency query over big data with Spark SQL

Building a High-Performance Database with Scala, Akka, and Spark

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...

Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

Scale-Out Using Spark in Serverless Herd Mode!

PaaSTA: Autoscaling at Yelp

Deploying Accelerators At Datacenter Scale Using Spark

Spark Summit EU talk by Jorg Schad

Understanding Memory Management In Spark For Fun And Profit

Top 5 mistakes when writing Spark applications

Monitoring of GPU Usage with Tensorflow Models Using Prometheus

Viewers also liked

Apache Hadoop - A Deep Dive (Part 1 - HDFS)

Debarchan Sarkar

HDFS Deep Dive

Zoltan C. Toth

Application architectures with Hadoop and Sessionization in MR

markgrover

HDFS Analysis for Small Files

DataWorks Summit/Hadoop Summit

This document summarizes a presentation about analyzing small files in HDFS clusters. It outlines the problems small files can cause, such as inefficient data access and slower jobs. It then describes the architecture of the small files analysis solution, which processes the HDFS fsimage to attribute and aggregate file information. This information is stored and used to power dashboards showing metrics like small file counts and distributions over time. Future work includes improving performance and developing a customizable compaction utility.

HBaseCon 2013: Using Apache HBase for Large Matrices

Cloudera, Inc.

This document discusses using HBase to store and access large sparse matrices for recommender systems and machine learning models. It describes how recommender systems use large user-item matrices as input and how state-of-the-art models learn user and item vectors. It then proposes implementing a HBase-backed matrix that provides random access to rows and cells for batch and online learning algorithms, allowing models to be easily deployed to production. Code is provided on GitHub to demonstrate this HBase matrix interface.

HBase Storage Internals

DataWorks Summit

Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.

HBase and HDFS: Understanding FileSystem Usage in HBase

enissoz

This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.

Big Data & Hadoop Tutorial

Edureka!

Viewers also liked (8)

Apache Hadoop - A Deep Dive (Part 1 - HDFS)

HDFS Deep Dive

Application architectures with Hadoop and Sessionization in MR

HDFS Analysis for Small Files

HBaseCon 2013: Using Apache HBase for Large Matrices

HBase Storage Internals

HBase and HDFS: Understanding FileSystem Usage in HBase

Big Data & Hadoop Tutorial

Similar to Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production

Building Efficient Pipelines in Apache Spark

Jeremy Beard

This document provides an overview of techniques for optimizing Apache Spark pipelines. It discusses fundamentals of Spark execution including jobs, stages and tasks. It then provides recommendations for tuning aspects like sizing executors, using DataFrames/Datasets over RDDs, caching frequently used data, joining techniques to avoid shuffling large datasets, and addressing skew. The document aims to help debug and optimize Spark applications.

Effective Spark on Multi-Tenant Clusters

DataWorks Summit/Hadoop Summit

This document discusses running Spark applications on YARN and managing Spark clusters. It covers challenges like predictable job execution times and optimal cluster utilization. Spark on YARN is introduced as a way to leverage YARN's resource management. Techniques like dynamic allocation, locality-aware scheduling, and resource queues help improve cluster sharing and utilization for multi-tenant workloads. Security considerations for shared clusters running sensitive data are also addressed.

Apache Spark Operations

Cloudera, Inc.

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

Jeremy Beard

This document discusses building near-real-time analytics pipelines using Apache Spark Streaming and Apache Kudu on the Cloudera platform. It defines near-real-time analytics, describes the relevant components of the Cloudera stack (Kafka, Spark, Kudu, Impala), and how they can work together. The document then outlines the typical stages involved in implementing a Spark Streaming to Kudu pipeline, including sourcing from a queue, translating data, deriving storage records, planning mutations, and storing the data. It provides performance considerations and introduces Envelope, a Spark Streaming application on Cloudera Labs that implements these stages through configurable pipelines.

Spark One Platform Webinar

Cloudera, Inc.

This document discusses Cloudera's initiative to make Spark the standard execution engine for Hadoop. It outlines how Spark improves on MapReduce by leveraging distributed memory and having a simpler developer experience. It also describes Cloudera's investments in areas like management, security, scale, and streaming to further Spark's capabilities and make it production-ready. The goal is for Spark to replace MapReduce as the execution engine and for specialized engines like Impala to handle specific workloads, with all sharing the same data, metadata, resource management, and other platform services.

Chicago spark meetup-april2017-public

Guru Dharmateja Medasani

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

Cloudera, Inc.

It’s no secret that Apache Spark is becoming the successor to MapReduce for data processing in Hadoop. With it’s easy development, flexible API, and performance benefits, Spark is a powerful data processing engine that has quickly gained popularity within the community. On the other hand Hive continues to be the most widely used data warehouse/ETL engine with large scale adoption across enterprises. Therefore, it’s imperative to enable Spark as the underlying execution engine for Hive to seamlessly allow existing and future Hive workloads to leverage the advantages of Spark. With the recent release of Cloudera 5.7, we have delivered on this goal by adding support for Hive-on-Spark. Data engineers and ETL developers can now transition from MR to Spark for their Hive workloads seamlessly thereby benefitting from the advantages of Spark without any disruption on their end. Join Santosh Kumar, Senior Product Manager at Cloudera, and Rui Li, Apache Hive committer and engineer at Intel, as we discuss: An Introduction to Spark and its advantages over MR An introduction of Hive-on-Spark: Goals and Design Principles Migrating to HoS and a live demo Configuring and tuning for batch workloads What’s next for both tools

Getting Apache Spark Customers to Production

Cloudera, Inc.

This document discusses common challenges customers face in getting Spark applications to production and provides recommendations to address them. It covers issues like misconfiguration, resource declaration, YARN configuration mismatches, data-dependent tuning like adjusting partitions, and ensuring security in shared clusters through authentication, encryption, and authorization measures. The document also recommends techniques like using dynamic allocation, reducing shuffles, and enabling multi-tenancy with YARN to improve cluster utilization for multiple customers.

Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi

Databricks

At Apple we rely on processing large datasets to power key components of Apple’s largest production services. Spark is continuing to replace and augment traditional MR workloads with its speed and low barrier to entry. Our current analytics infrastructure consists of over an exabyte of storage and close to a million cores. Our footprint is also growing further with the addition of new elastic services for streaming, adhoc and interactive analytics. In this talk we will cover the challenges of working at scale with tricks and lessons learned managing large multi-tenant clusters. We will also discuss designing and building a self-service elastic analytics platform on Mesos.

Cloudera showcase c5.4

Cloudera, Inc.

This document discusses the key updates and focus areas for Cloudera's upcoming C5.4 release, including improvements to data governance, open standards support, platform support, core scalability, and enterprise security. Some highlights include expanded data lineage tracking, support for new cloud platforms, performance optimizations, and integration with xPlain.io for data modeling and query troubleshooting. The release will also include updates to core components like HDFS, HBase, Hive, Impala and Spark to improve scalability, stability, and production readiness.

Oracle EM12c Release 4 New Features!

Kellyn Pot'Vin-Gorman

The document discusses new features in Oracle Enterprise Manager 12c, including enhanced security, monitoring, provisioning capabilities, and support for managing databases and middleware as services. Some key highlights include improved role-based access control, global preferred credentials, self-service portals, database cloning using Direct NFS, and active thin cloning of databases on ASM storage.

Part 2: A Visual Dive into Machine Learning and Deep Learning  

Cloudera, Inc.

Yarns about YARN: Migrating to MapReduce v2

DataWorks Summit

This document summarizes a presentation about migrating from MapReduce v1 to MapReduce v2 (MRv2) on YARN. Some key points: - MRv2 (on YARN) provides improved scalability, availability, utilization, and multi-tenancy compared to MRv1. - Migrating from MRv1 to MRv2 involves mapping configurations and understanding differences in functionality between the architectures. - Potential pitfalls in upgrading to YARN include ensuring proper log configuration and addressing issues like containers being killed for exceeding memory limits. - YARN allows various applications beyond MapReduce like Spark, Slider, and Llama to share cluster resources. Configuration is needed to

Kafka for DBAs

Gwen (Chen) Shapira

This document discusses Apache Kafka and how it can be used by Oracle DBAs. It begins by explaining how Kafka builds upon the concept of a database redo log by providing a distributed commit log service. It then discusses how Kafka is a publish-subscribe messaging system and can be used to log transactions from any database, application logs, metrics and other system events. Finally, it discusses how schemas are important for Kafka since it only stores messages as bytes, and how Avro can be used to define and evolve schemas for Kafka messages.

Productionizing Spark and the Spark Job Server

Evan Chan

You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.

Updated Power of the AWR Warehouse, Dallas, HQ, etc.

Kellyn Pot'Vin-Gorman

Decoupling Decisions with Apache Kafka

Grant Henke

Whether you are developing a greenfield data project or migrating a legacy system, there are many critical design decisions to be made. Often, it is advantageous to not only consider immediate requirements, but also the future requirements and technologies you may want to support. Your project may start out supporting batch analytics with the vision of adding realtime support. Or your data pipeline may feed data to one technology today, but tomorrow an entirely new system needs to be integrated. Apache Kafka can help decouple these decisions and provide a flexible core to your data architecture. This talk will show how building Kafka into your pipeline can provide the flexibility to experiment, evolve and grow. It will also cover a brief overview of Kafka, its architecture, and terminology.

Kscope Not Your Father's Enterprise Manager

Kellyn Pot'Vin-Gorman

Get most out of Spark on YARN

DataWorks Summit

This document provides information about running Spark on YARN including: - Spark allows processing of large datasets in a distributed manner using Resilient Distributed Datasets (RDDs). - When running on YARN, Spark is able to leverage existing Hadoop clusters for locality-aware processing, resource management, and other benefits while still using its own execution engine. - Running Spark on YARN provides advantages like shipping code to where the data is located instead of moving large amounts of data, leveraging existing Hadoop cluster infrastructure, and allowing Spark workloads to run natively within Hadoop.

Power of the AWR Warehouse- HotSos Symposium 2015

Kellyn Pot'Vin-Gorman

This document discusses the Power of the AWR Warehouse and beyond in Oracle Enterprise Manager 12c Release 4. It provides an overview of the architecture and features of the AWR Warehouse, including how it allows centralized long-term storage and analysis of AWR data across multiple databases without runtime overhead. The key benefits are space savings, offloading resource demands for deep analysis to the warehouse, and centralizing data identified by database to analyze multiple databases.

Similar to Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production (20)

Building Efficient Pipelines in Apache Spark

Effective Spark on Multi-Tenant Clusters

Apache Spark Operations

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

Spark One Platform Webinar

Chicago spark meetup-april2017-public

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

Getting Apache Spark Customers to Production

Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi

Cloudera showcase c5.4

Oracle EM12c Release 4 New Features!

Part 2: A Visual Dive into Machine Learning and Deep Learning  

Yarns about YARN: Migrating to MapReduce v2

Kafka for DBAs

Productionizing Spark and the Spark Job Server

Updated Power of the AWR Warehouse, Dallas, HQ, etc.

Decoupling Decisions with Apache Kafka

Kscope Not Your Father's Enterprise Manager

Get most out of Spark on YARN

Power of the AWR Warehouse- HotSos Symposium 2015

Recently uploaded

Object Oriented Analysis and Design - OOAD

PreethaV16

Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024

Sinan KOZAK

Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.

Open Channel Flow: fluid flow with a free surface

Indrajeet sahu

Open Channel Flow: This topic focuses on fluid flow with a free surface, such as in rivers, canals, and drainage ditches. Key concepts include the classification of flow types (steady vs. unsteady, uniform vs. non-uniform), hydraulic radius, flow resistance, Manning's equation, critical flow conditions, and energy and momentum principles. It also covers flow measurement techniques, gradually varied flow analysis, and the design of open channels. Understanding these principles is vital for effective water resource management and engineering applications.

5G Radio Network Througput Problem Analysis HCIA.pdf

AlvianRamadhani5

4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf

Gino153088

NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT

Addu25809

Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...

Transcat

Join us for this solutions-based webinar on the tools and techniques for commissioning and maintaining PV Systems. In this session, we'll review the process of building and maintaining a solar array, starting with installation and commissioning, then reviewing operations and maintenance of the system. This course will review insulation resistance testing, I-V curve testing, earth-bond continuity, ground resistance testing, performance tests, visual inspections, ground and arc fault testing procedures, and power quality analysis. Fluke Solar Application Specialist Will White is presenting on this engaging topic: Will has worked in the renewable energy industry since 2005, first as an installer for a small east coast solar integrator before adding sales, design, and project management to his skillset. In 2022, Will joined Fluke as a solar application specialist, where he supports their renewable energy testing equipment like IV-curve tracers, electrical meters, and thermal imaging cameras. Experienced in wind power, solar thermal, energy storage, and all scales of PV, Will has primarily focused on residential and small commercial systems. He is passionate about implementing high-quality, code-compliant installation techniques.

Introduction to Computer Networks & OSI MODEL.ppt

Dwarkadas J Sanghvi College of Engineering

Blood finder application project report (1).pdf

Kamal Acharya

Blood Finder is an emergency time app where a user can search for the blood banks as well as the registered blood donors around Mumbai. This application also provide an opportunity for the user of this application to become a registered donor for this user have to enroll for the donor request from the application itself. If the admin wish to make user a registered donor, with some of the formalities with the organization it can be done. Specialization of this application is that the user will not have to register on sign-in for searching the blood banks and blood donors it can be just done by installing the application to the mobile. The purpose of making this application is to save the user’s time for searching blood of needed blood group during the time of the emergency. This is an android application developed in Java and XML with the connectivity of SQLite database. This application will provide most of basic functionality required for an emergency time application. All the details of Blood banks and Blood donors are stored in the database i.e. SQLite. This application allowed the user to get all the information regarding blood banks and blood donors such as Name, Number, Address, Blood Group, rather than searching it on the different websites and wasting the precious time. This application is effective and user friendly.

ITSM Integration with MuleSoft.pptx

VANDANAMOHANGOUDA

Accident detection system project report.pdf

Kamal Acharya

The Rapid growth of technology and infrastructure has made our lives easier. The advent of technology has also increased the traffic hazards and the road accidents take place frequently which causes huge loss of life and property because of the poor emergency facilities. Many lives could have been saved if emergency service could get accident information and reach in time. Our project will provide an optimum solution to this draw back. A piezo electric sensor can be used as a crash or rollover detector of the vehicle during and after a crash. With signals from a piezo electric sensor, a severe accident can be recognized. According to this project when a vehicle meets with an accident immediately piezo electric sensor will detect the signal or if a car rolls over. Then with the help of GSM module and GPS module, the location will be sent to the emergency contact. Then after conforming the location necessary action will be taken. If the person meets with a small accident or if there is no serious threat to anyone’s life, then the alert message can be terminated by the driver by a switch provided in order to avoid wasting the valuable time of the medical rescue team.

LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant

Anant Corporation

Supermarket Management System Project Report.pdf

Kamal Acharya

Supermarket management is a stand-alone J2EE using Eclipse Juno program. This project contains all the necessary required information about maintaining the supermarket billing system. The core idea of this project to minimize the paper work and centralize the data. Here all the communication is taken in secure manner. That is, in this application the information will be stored in client itself. For further security the data base is stored in the back-end oracle and so no intruders can access it.

1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf

MadhavJungKarki

Mechatronics material . Mechanical engineering

sachin chaurasia

Mechatronics is a multidisciplinary field that refers to the skill sets needed in the contemporary, advanced automated manufacturing industry. At the intersection of mechanics, electronics, and computing, mechatronics specialists create simpler, smarter systems. Mechatronics is an essential foundation for the expected growth in automation and manufacturing. Mechatronics deals with robotics, control systems, and electro-mechanical systems.

一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理

ecqow

CalArts毕业证学历书【微信95270640】CalArts毕业证’圣力嘉学院毕业证《Q微信95270640》办理CalArts毕业证√文凭学历制作{CalArts文凭}购买学历学位证书本科硕士,CalArts毕业证学历学位证【实体公司】办毕业证、成绩单、学历认证、学位证、文凭认证、办留信网认证、（网上可查，实体公司，专业可靠） (诚招代理)办理国外高校毕业证成绩单文凭学位证,真实使馆公证（留学回国人员证明）真实留信网认证国外学历学位认证雅思代考国外学校代申请名校保录开请假条改GPA改成绩ID卡 1.高仿业务:【本科硕士】毕业证,成绩单（GPA修改）,学历认证（教育部认证）,大学Offer,,ID,留信认证,使馆认证,雅思,语言证书等高仿类证书； 2.认证服务: 学历认证（教育部认证）,大使馆认证（回国人员证明）,留信认证（可查有编号证书）,大学保录取,雅思保分成绩单。 3.技术服务：钢印水印烫金激光防伪凹凸版设计印刷激凸温感光标底纹镭射速度快。办理加利福尼亚艺术学院加利福尼亚艺术学院毕业证文凭证书流程： 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄） -办理真实使馆公证（即留学回国人员证明） -办理各国各大学文凭（世界名校一对一专业服务,可全程监控跟踪进度） -全套服务：毕业证成绩单真实使馆公证真实教育部认证。让您回国发展信心十足！（详情请加一下文凭顾问+微信:95270640）欢迎咨询！子小伍玩小伍比山娃小一岁虎头虎脑的很霸气父亲让山娃跟小伍去夏令营听课山娃很高兴夏令营就设在附近一所小学山娃发现那所小学比自己的学校更大更美操场上还铺有塑胶跑道呢里面很多小朋友一班一班的快快乐乐原来城里娃都藏这儿来了怪不得平时见不到他们山娃恍然大悟起来吹拉弹唱琴棋书画山娃都不懂却什么都想学山娃怨自己太笨什么都不会斟酌再三山娃终于选定了学美术当听说每月要交元时父亲犹豫了山娃也说爸算了吧咱学校一学期才转

Zener Diode and its V-I Characteristics and Applications

Shiny Christobel

Height and depth gauge linear metrology.pdf

q30122000

Height gauges may also be used to measure the height of an object by using the underside of the scriber as the datum. The datum may be permanently fixed or the height gauge may have provision to adjust the scale, this is done by sliding the scale vertically along the body of the height gauge by turning a fine feed screw at the top of the gauge; then with the scriber set to the same level as the base, the scale can be matched to it. This adjustment allows different scribers or probes to be used, as well as adjusting for any errors in a damaged or resharpened probe.

一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理

upoux

原版一模一样【微信：741003700 】【(osu毕业证书)美国俄勒冈州立大学毕业证成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理(osu毕业证书)美国俄勒冈州立大学毕业证【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(osu毕业证书)美国俄勒冈州立大学毕业证【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(osu毕业证书)美国俄勒冈州立大学毕业证【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(osu毕业证书)美国俄勒冈州立大学毕业证【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

An Introduction to the Compiler Designss

ElakkiaU

Recently uploaded (20)

Object Oriented Analysis and Design - OOAD

Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024

Open Channel Flow: fluid flow with a free surface

5G Radio Network Througput Problem Analysis HCIA.pdf

4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf

NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT

Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...

Introduction to Computer Networks & OSI MODEL.ppt

Blood finder application project report (1).pdf

ITSM Integration with MuleSoft.pptx

Accident detection system project report.pdf

LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant

Supermarket Management System Project Report.pdf

1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf

Mechatronics material . Mechanical engineering

一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理

Zener Diode and its V-I Characteristics and Applications

Height and depth gauge linear metrology.pdf

一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理

An Introduction to the Compiler Designss

Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production

4. 4© Cloudera, Inc. All rights reserved. Information Gathering Regular Spark Applications and memory configuration problems Problems in the Ingest pipelines Problems in interacting with Kafka Usually Resource Management issues Spark Core Spark Streaming Spark on YARN Source: • Cloudera Customer Opened Cases • Posts in Cloudera Community Forums • Spark use cases for the following:

5. 5© Cloudera, Inc. All rights reserved. Classification 1. Scaling of the Architecture 2. Memory Configurations 3. End user Code 4. Incompatible Dependencies 5. Administration/Operation related issues

7. 7© Cloudera, Inc. All rights reserved. 1. Understand the initial setup and current one 2. Look at logs / console to find the error 3. Determine parameters 4. Tune away 1. Scale of the Architecture: Larger Datasets

8. 8© Cloudera, Inc. All rights reserved. YARN Remember: Container Memory • yarn.nodemanager.resource.memory-mb Container Memory Maximum • yarn.scheduler.maximum-allocation-mb Container Memory Minimum • yarn.scheduler.minimum-allocation-mb 2. Memory Configurations Standalone Java Heap Size of Worker in Bytes • worker_max_heapsize Java Heap Size of Master in Bytes • master_max_heapsize

9. 9© Cloudera, Inc. All rights reserved. Symptoms: Executor error: ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker- 0,5,main] java.lang.OutOfMemoryError: Java heap space Driver error: ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default- dispatcher-60] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space Cause: The driver or executor is configured with a memory limit that is too low for the application being run. 2. Memory Configurations: Insufficient Memory

10. 10© Cloudera, Inc. All rights reserved. 1. Understand Spark Configuration Order of Precedence 1. Spark configuration file: Parameters are passed to SparkConf. 2. Command-line arguments are passed to spark-submit, spark-shell, or pyspark. 3. Cloudera Manager UI properties set in the spark-defaults.conf configuration file. 2. Verify the Driver and Executor Memory Used 1. Environment tab in History Server 2. Check driver and executor memory usage values 3. Increase the Heap Caveat: do not set the spark.driver.memory config using SparkConf , use - - driver-memory • spark.driver.memory • spark.executor.memory 2. Memory Configurations: Insufficient Memory

13. 13© Cloudera, Inc. All rights reserved. Symptoms Error: Could not find or load main class org.apache.spark.deploy.yarn.Application Master java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.MRJobCon fig java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStrea m ImportError: No module named qt_returns_summary_wrapper 4. Incompatible Dependencies: Classpath Issues

14. 14© Cloudera, Inc. All rights reserved. What to Do? 1. Verify if an application is using classes that are not part of the framework (Hadoop/Spark) 1. Verify that an application is built using correct artifacts. 4. Incompatible Dependencies: Classpath Issues

Editor's Notes

So just to begin with a brief introduction: Spark is an in-memory framework that enables a user to process data and perform analytics using the different components within it. This includes Core (Which is the main crux or the engine of Spark) Streaming (which allows you to process data that comes to the user in the form a stream of bits) MLLib ( the machine learning library that empowers the users to explore the usage of Machine learning algorithms to enable better insights) SQL ( the SQL engine in the Spark framework ) Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn't do distributed storage. Having been in the environment that includes a lot of other components that form the CDH ecosystem, Spark has increased a lot in terms of Adoption.

Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (8)

Similar to Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production

Similar to Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production (20)

Recently uploaded

Recently uploaded (20)

Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production

Editor's Notes