Berlin buzzwords 2018

•Download as PPTX, PDF•

0 likes•41 views

DataFrame is an awesome interface for data manipulation in Spark but when the complexity grows outside of the capabilities of Spark itself, you need to resort to "violence". In this talk I will explain one of the projects which became too complex to be executed using the DataFrame API and had to be rewritten into a custom code applied using mapPartitions function. We will cover some of the tips and tricks for reducing lineage complexity, share our process of analyzing pain points and get into details of mapPartitions functionality to leverage Spark's distributed processing capabilities and reliability while executing custom code.

Software

When DataFrames fail,
resort to mapPartitions
aka “What to do when Spark can’t handle it”
Berlin Buzzwords, June 2018
Matija Gobec
matija.gobec@smartcat.io
@mad_max0204

Complex problem with an easy solution
(or not)

2k+ lines of resulting dataframe tree schema

Logically correct solution
● Data analysis process (data exploration)
● Get to the correct output ASAP
● Validate result
● Run on subset (if possible)
○ Quick iterations
○ Tends to eliminate data issues

Correct solution
● Correct result
● Repeatable performance
● Scalable and optimized
● Proper data handling

spark.read.jdbc
What in the world?
Specify columnName for proper distribution
numPartitions defines parallelism
Query for bounds (upper and lower)
val df = (spark.read.jdbc(url=jdbcUrl,
table="some_table",
columnName="some_column",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties))

(Re)partitioning
df.repartition(100) - repartition to a number of partitions
df.repartition(100, col("some_column")) - repartition and split on column
df.coalesce(10) - coalesce (narrow dependency)

(Re)partitioning
Partition by a join field
Repartition is expensive but can help down the road
coalesce vs repartition
HashPartitioner
Number of partitions divisible by available cores
(usually from 2 to 10 x cores)

Data locality
PROCESS_LOCAL - data is in the same JVM
NODE_LOCAL - data is on the same node (HDFS, local executor)
RACK_LOCAL, ANY - data is not on the executors
NO_PREF - no locality preference

Broadcasting variables
Keeps a copy of data on each worker
Reduces the size of a serialized task as data is on the worker already
Optimizes join operations

Checkpointing
Truncate lineage graph
Reliable - HDFS storage (sc.setCheckpointDir)
Local - node local temp storage

Metrics
Spark WebUI is your friend
Use Spark History Server
JSON metrics over REST API
Collect JMX metrics (Dropwizard)

Let’s use it
Spark is a distributed computing framework
Spark provides resource management
Spark provides controllable work distribution
Spark is highly configurable
Why not?

MapPartitions function
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]

MapPartitions function - indexed
/**
* Return a new RDD by applying a function to each partition of this RDD, while tracking the index
* of the original partition.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]

MapPartitions example
def mapFunction(args): Iterator[Row] => Iterator[Row] = iterator => {
// Processing based on the partition data
// Example:
// val ids = iterator.toList.map(row => row.getInt(0))
// Processor.doWork(ids)
iterator
}

Other usages
Applying custom processing code
Running ML pipeline
Using third party libraries
Misc

Results
Processing time dropped from ~96 hours to ~3.5hours (28 times)
Adding new entities doesn’t significantly impact complexity/performance
Additional control which we leveraged down the road
We managed to use all the resources
We managed to use all the resources
Raised code complexity
PROs
CONs

Matija Gobec
matija.gobec@smartcat.io
@mad_max0204
Thank you

As a developer using PostgreSQL one of the most important tasks you have to deal with is modeling the database schema for your application. In order to achieve a solid design, it’s important to understand how the schema is then going to be used as well as the trade-offs it involves. As Fred Brooks said: “Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.” In this talk we're going to see practical normalisation examples and their benefits, and also review some anti-patterns and their typical PostgreSQL solutions, including Denormalization techniques thanks to advanced Data Types.

JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...

Citus Data

When do you use jsonb, and when don’t you? How do you make it fast? What operators are available, and what can they do? How will this change? These are all very good questions, but jsonb support in Postgres moves so fast that it’s hard to keep up. In this talk, you will get details on these topics, complete with practical examples and real-world stories: - When to use jsonb, what it’s good for, and when to not use it - Operators and how to use them effectively - Indexing, operator support for indexes, and the tradeoffs involved - Postgres 12 improvements and new features

Photon Technical Deep Dive: How to Think Vectorized

Databricks

Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.

ACADILD:: HADOOP LESSON

Padma shree. T

This document discusses using MapReduce to find the top K records in a distributed dataset based on a specific criteria. It begins by explaining MapReduce and its limitations. It then describes finding the top K records on a single machine by sorting the data and selecting the top K. For MapReduce, each mapper finds the top K records within its split and sends to the reducer. The reducer finds the global top K by sorting all records and selecting the top K overall. An example algorithm and sample data are provided to demonstrate how to implement a MapReduce job to solve this problem.

No more struggles with Apache Spark workloads in production

Chetan Khatri

Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production. Apache Spark Primary data structures (RDD, DataSet, Dataframe) Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. Parallel read from JDBC: Challenges and best practices. Bulk Load API vs JDBC write An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin Avoid unnecessary shuffle Alternative to spark default sort Why dropDuplicates() doesn’t result consistency, What is alternative Optimize Spark stage generation plan Predicate pushdown with partitioning and bucketing Why not to use Scala Concurrent ‘Future’ explicitly!

Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot

Citus Data

Citus is a sharding extension for postgres that can efficiently distribute a wide range of SQL queries. It uses postgres' planner hook to transparently intercept and plan queries on "distributed" tables. Citus then executes the queries in parallel across many servers, in a way that delegates most of the heavy lifting back to postgres. Within Citus, we distinguish between several types of SQL queries, which each have their own planning logic: Local-only queries Single-node “router” queries Multi-node “real-time” queries Multi-stage queries Each type of query corresponds to a different use case, and Citus implements several planners and executors using different techniques to accommodate the performance requirements and trade-offs for each use case. This talk will discuss the internals of the different types of planners and executors for distributing SQL on top of postgres, and how they can be applied to different use cases.

Dynamic Width File in Spark_2016

Subhasish Guha

Distributed Query Processing

Mythili Kannan

This document discusses distributed database and distributed query processing. It covers topics like distributed database, query processing, distributed query processing methodology including query decomposition, data localization, and global query optimization. Query decomposition involves normalizing, analyzing, eliminating redundancy, and rewriting queries. Data localization applies data distribution to algebraic operations to determine involved fragments. Global query optimization finds the best global schedule to minimize costs and uses techniques like join ordering and semi joins. Local query optimization applies centralized optimization techniques to the best global execution schedule.

This document summarizes how switching from Hadoop to Spark for data science applications improved performance, reliability, and reduced costs at Salesforce. Some key issues addressed were handling large datasets across many S3 prefixes, efficiently computing segment overlap on skewed user data, and performing joins on highly skewed datasets. These changes resulted in applications that were 100x faster, used 10x less data, had fewer failures, and reduced infrastructure costs.

Semi join

Alokeparna Choudhury

This presentation discusses semi-joins and their effectiveness in distributed environments. It begins by defining distributed systems and relational algebra operations like joins. It then explains that a semi-join returns rows from the first table that match rows in the second table, without duplicate rows. Examples are provided to illustrate how semi-joins can reduce communication costs compared to conventional joins. The presentation concludes by stating that semi-joins are an efficient way to join data across multiple tables in a query in distributed systems.

Hive query optimization infinity

Shashwat Shriparv

Well designed tables like partitioning and bucketing can improve query speed and reduce costs. Partitioning involves horizontally slicing data, such as by date or location. Bucketing imposes structure allowing more efficient queries, sampling, and map-side joins. Parallel query execution allows subqueries to run simultaneously to improve performance. The explain command helps analyze queries and identify optimizations.

SQL Joins and Query Optimization

Brian Gallagher

CS 542 -- Query Optimization

J Singh

The document discusses query optimization in database management systems. It covers converting SQL queries to logical and physical query plans, improving logical plans through algebraic transformations, and choosing the optimal physical query plan by considering the order of operations and join trees. The goal is to select the most efficient physical plan by estimating the size of relations and intermediate results.

A Graph-Based Method For Cross-Entity Threat Detection

Jen Aman

This document proposes a graph-based method for cross-entity threat detection. It models entity relationships as a multigraph and detects anomalies by identifying unexpected new connections between entities over time. It introduces two algorithms: a naive detector that identifies edges only in the detection graph, and a 2nd-order detector that identifies edges between entity clusters. An experiment on a real dataset found around 700 1st-order and 200 2nd-order anomalies in under 5 minutes, demonstrating the method's ability to efficiently detect threats across unrelated accounts.

ETL and pivoting in spark

Subhasish Guha

1. The document discusses handling small file problems in Spark ETL pipelines. It recommends keeping partition sizes between 2GB and not too small to avoid overhead problems. 2. It provides examples of transformations like aggregation, normalization, and lookup that are commonly used. 3. Pivoting data in Spark is presented as an efficient solution to transform data compared to traditional ETL tools. The example pivots data to summarize by year and quarter within minutes for billions of records.

Hadoop map reduce concepts

Subhas Kumar Ghosh

The document describes Hadoop MapReduce and its key concepts. It discusses how MapReduce allows for parallel processing of large datasets across clusters of computers using a simple programming model. It provides details on the MapReduce architecture, including the JobTracker master and TaskTracker slaves. It also gives examples of common MapReduce algorithms and patterns like counting, sorting, joins and iterative processing.

Database , 8 Query Optimization

Ali Usman

The document discusses distributed query processing and optimization in distributed database systems. It covers topics like query decomposition, distributed query optimization techniques including cost models, statistics collection and use, and algorithms for query optimization. Specifically, it describes the process of optimizing queries distributed across multiple database fragments or sites including generating the search space of possible query execution plans, using cost functions and statistics to pick the best plan, and examples of deterministic and randomized search strategies used.

Overview of query evaluation

avniS

The document discusses Oracle system catalogs which contain metadata about database objects like tables and indexes. System catalogs allow accessing information through views with prefixes like USER, ALL, and DBA. Examples show how to query system catalog views to get information on tables, columns, indexes and views. Query optimization and evaluation are also covered, explaining how queries are parsed, an execution plan is generated, and the least cost plan is chosen.

Guacamole Fiesta: What do avocados and databases have in common?

ArangoDB Database

First, our CTO, Frank Celler, does a quick overview of the latest feature developments and what is new with ArangoDB. Then, Senior Graph Specialist, Michael Hackstein talks about multi-model database movement, diving deeper into main advantages and technological benefits. He introduces three data-models of ArangoDB (Documents, Graphs and Key-Values) and the reasons behind the technology. We have a look at the ArangoDB Query language (AQL) with hands-on examples. Compare AQL to SQL, see where the differences are and what makes AQL better comprehensible for developers. Finally, we touch the Foxx Microservice framework which allows to easily extend ArangoDB and include it in your microservices landscape.

Unit 3 lecture-2

vishal choudhary

1) Comparable and Comparator interfaces in Java can both be used to sort collections but have some key differences. Comparable provides a single sort order while Comparator allows multiple sort orders. 2) WritableComparable is a subinterface that extends both Writable and Comparable interfaces. Any type used as a key in Hadoop MapReduce must implement this interface. 3) RawComparator is an optimization Hadoop provides to directly compare byte arrays containing serialized objects instead of deserializing for comparison. WritableComparator implements RawComparator for WritableComparable types.

Is there a perfect data-parallel programming language? (Experiments with More...

Julian Hyde

The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark. In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems. We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.

Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl

Citus Data

Your PostgreSQL database is one of the most important pieces of your architecture - yet the level of introspection available in Postgres is often hard to work with. Its easy to get very detailed information, but what should you really watch out for, send reports on and alert on? In this talk we'll discuss how query performance statistics can be made accessible to application developers, critical entries one should monitor in the PostgreSQL log files, how to collect EXPLAIN plans at scale, how to watch over autovacuum and VACUUM operations, and how to flag issues based on schema statistics. We'll also talk a bit about monitoring multi-server setups, first going into high availability and read standbys, logical replication, and then reviewing how monitoring looks like for sharded databases like Citus. The talk will primarily describe free/open-source tools and statistics views readily available from within Postgres.

Basic Tutorial of Association Mapping by Avjinder Kaler

Avjinder (Avi) Kaler

This document provides a tutorial for performing association mapping using various software. It outlines the steps to format genotype and phenotype data files, estimate population structure with STRUCTURE, and perform association analysis using TASSEL and GAPIT. The steps include filtering markers, formatting files for each program, running models like GLM, MLM, and compressed MLM, and interpreting resulting Manhattan plots, QQ-plots and tables of associated markers and p-values. References for additional tutorials and literature on association mapping techniques are also provided.

2. R-basics, Vectors, Arrays, Matrices, Factors

krishna singh

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0 This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial: 1) Why MapReduce? 2) Write a MapReduce Job to Count Unique Words in a Text File 3) Create Mapper and Reducer in Java 4) Create Driver 5) MapReduce Input Splits, Secondary Sorting, and Partitioner 6) Combiner Functions in MapReduce 7) Job Chaining and Pipes in MapReduce

Как приготовить тестовые данные для Big Data проекта. Пример из практики

SQALab

The document describes an approach to sampling large datasets for testing Big Data projects. It discusses clustering similar records into groups to reduce the size of the sample. The records are divided into quartiles based on field values and each unique combination of quartiles forms a cluster. A representative sample is taken by selecting one record from each cluster. A tool is presented that automates this process using T-SQL on SQL Server to partition, cluster and filter the data into a reduced representative sample for testing.

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Spark Summit

Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created. Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production. In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.

Map reduce: beyond word count

Jeff Patti

This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ ) Map Reduce: Beyond Word Count by Jeff Patti Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.

Big Data Analytics with Apache Spark

MarcoYuriFujiiMelo

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Databricks

What's hot

Scaling up data science applications

Kexin Xie

Semi join

Alokeparna Choudhury

Hive query optimization infinity

Shashwat Shriparv

SQL Joins and Query Optimization

Brian Gallagher

CS 542 -- Query Optimization

J Singh

A Graph-Based Method For Cross-Entity Threat Detection

Jen Aman

ETL and pivoting in spark

Subhasish Guha

Hadoop map reduce concepts

Subhas Kumar Ghosh

Database , 8 Query Optimization

Ali Usman

Overview of query evaluation

avniS

Guacamole Fiesta: What do avocados and databases have in common?

ArangoDB Database

Unit 3 lecture-2

vishal choudhary

Is there a perfect data-parallel programming language? (Experiments with More...

Julian Hyde

Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl

Citus Data

Basic Tutorial of Association Mapping by Avjinder Kaler

Avjinder (Avi) Kaler

2. R-basics, Vectors, Arrays, Matrices, Factors

krishna singh

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...

CloudxLab

Как приготовить тестовые данные для Big Data проекта. Пример из практики

SQALab

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Spark Summit

Map reduce: beyond word count

Jeff Patti

What's hot (20)

Scaling up data science applications

Semi join

Hive query optimization infinity

SQL Joins and Query Optimization

CS 542 -- Query Optimization

A Graph-Based Method For Cross-Entity Threat Detection

ETL and pivoting in spark

Hadoop map reduce concepts

Database , 8 Query Optimization

Overview of query evaluation

Guacamole Fiesta: What do avocados and databases have in common?

Unit 3 lecture-2

Is there a perfect data-parallel programming language? (Experiments with More...

Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl

Basic Tutorial of Association Mapping by Avjinder Kaler

2. R-basics, Vectors, Arrays, Matrices, Factors

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...

Как приготовить тестовые данные для Big Data проекта. Пример из практики

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Map reduce: beyond word count

Similar to Berlin buzzwords 2018

Big Data Analytics with Apache Spark

MarcoYuriFujiiMelo

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Databricks

Tuning and Debugging in Apache Spark

Databricks

Learning spark ch04 - Working with Key/Value Pairs

phanleson

Tuning and Debugging in Apache Spark

Patrick Wendell

Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.

04 spark-pair rdd-rdd-persistence

Venkat Datla

This document provides an overview of Spark Pair RDDs and persistence. It defines Pair RDDs as RDDs of key-value pairs that enable aggregations. It describes how to create Pair RDDs in Scala, Python, and Java and covers common Pair RDD transformations like groupByKey, reduceByKey, aggregateByKey, combineByKey, joins, sorting by key, and actions. It also discusses the differences between groupByKey and reduceByKey and demonstrates various Pair RDD functions and persistence levels.

Databricks spark-knowledge-base-1

Rahul Kumar

This document covers best practices, troubleshooting, performance optimization, and Spark Streaming topics for Spark. It includes recommendations to avoid GroupByKey, not copy all elements of a large RDD to the driver, and gracefully deal with bad input data. Common errors like "Task not serializable" and missing dependencies are discussed. It also provides tips on checking the number of RDD partitions and improving data locality for performance.

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...

Databricks

The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.

Apache spark - Architecture , Overview & libraries

Walaa Hamdy Assy

This document provides an overview of Apache Spark, an open-source unified analytics engine for large-scale data processing. It discusses Spark's core APIs including RDDs and transformations/actions. It also covers Spark SQL, Spark Streaming, MLlib, and GraphX. Spark provides a fast and general engine for big data processing, with explicit operations for streaming, SQL, machine learning, and graph processing. The document includes installation instructions and examples of using various Spark components.

Optimizations in Spark; RDD, DataFrame

Knoldus Inc.

Developing Apache Spark Jobs is the easier part of the process but the difficult portion comes in while executing them under full load as each job is unique when it comes to performance. Spark programs often face bottlenecks in terms of CPU, network bandwidth, memory usage which stems from Spark’s basic nature of in-memory computations. In this webinar, we will deal with the problem of how optimally you can perform your job operations in Apache Spark. We will address common performance problems including - ~ Inadequate transformations when working with RDD API as optimization is the developer’s responsibility, unlike in SQL querying language. ~ Proper partitioning of data so that Spark can perform tasks optimally ~ Why DataFrames have better performance than RDD? Here’s the agenda of the webinar - ~ Spark Execution Model ~ Optimizing Shuffle Operations ~ Optimizing Functions ~ SQL VS RDD ~ Logical & Physical Plan ~ Optimizing Joins

Apache Spark: What? Why? When?

Massimo Schenone

Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.

Apache Spark and DataStax Enablement

Vincent Poncet

Beyond shuffling - Strata London 2016

Holden Karau

Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production. Topics include: Working with key/value data Replacing groupByKey for awesomeness Key skew: your data probably has it and how to survive Effective caching and checkpointing Considerations for noisy clusters Functional transformations with Spark Datasets: getting the benefits of Catalyst with the ease of functional development How to make our code testable

Quick Guide to Refresh Spark skills

Ravindra kumar

AI與大數據數據處理 Spark實戰(20171216)

Paul Chao

This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.

Bigdata processing with Spark - part II

Arjen de Vries

This document provides a summary of Spark RDDs and the Spark execution model: - RDDs (Resilient Distributed Datasets) are Spark's fundamental data structure, representing an immutable distributed collection of objects that can be operated on in parallel. RDDs track lineage to support fault tolerance and optimization. - Spark uses a logical plan built from transformations on RDDs, which is then optimized and scheduled into physical stages and tasks by the Spark scheduler. Tasks operate on partitions of RDDs in a data-parallel manner. - The scheduler pipelines transformations where possible, truncates redundant work, and leverages caching and data locality to improve performance. It splits the graph into stages separated by shuffle operations or parent RDD boundaries

Introduction to Spark - DataFactZ

DataFactZ

We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms. Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations. This presentation is designed for Spark Enthusiasts to get started and details of the course are below. 1. Introduction to Apache Spark 2. Functional Programming + Scala 3. Spark Core 4. Spark SQL + Parquet 5. Advanced Libraries 6. Tips & Tricks 7. Where do I go from here?

Drizzles Approach To Improving Performance Of The Server

PerconaPerformance

The document discusses Drizzle's approach to database performance. It values open discussion, a focus on interfaces over implementations, avoiding "magic" code, and using standard libraries. It also discusses cleaning up Drizzle's codebase to make it easier for developers to contribute. The document then provides an example of complex code that "makes baby kittens cry" and could benefit from refactoring.

Apache Spark - San Diego Big Data Meetup Jan 14th 2015

cdmaxime

This document provides an introduction to Apache Spark presented by Maxime Dumas of Cloudera. It discusses: 1. What Cloudera does including distributing Hadoop components with enterprise tooling and support. 2. An overview of the Apache Hadoop ecosystem including why Hadoop is used for scalability, efficiency, and flexibility with large amounts of data. 3. An introduction to Apache Spark which improves on MapReduce by being faster, easier to use, and supporting more types of applications such as machine learning and graph processing. Spark can be 100x faster than MapReduce for certain applications.

Koalas: Making an Easy Transition from Pandas to Apache Spark

Databricks

In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python. Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle. When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes. Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas

Similar to Berlin buzzwords 2018 (20)

Big Data Analytics with Apache Spark

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Tuning and Debugging in Apache Spark

Learning spark ch04 - Working with Key/Value Pairs

Tuning and Debugging in Apache Spark

04 spark-pair rdd-rdd-persistence

Databricks spark-knowledge-base-1

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...

Apache spark - Architecture , Overview & libraries

Optimizations in Spark; RDD, DataFrame

Apache Spark: What? Why? When?

Apache Spark and DataStax Enablement

Beyond shuffling - Strata London 2016

Quick Guide to Refresh Spark skills

AI與大數據數據處理 Spark實戰(20171216)

Bigdata processing with Spark - part II

Introduction to Spark - DataFactZ

Drizzles Approach To Improving Performance Of The Server

Apache Spark - San Diego Big Data Meetup Jan 14th 2015

Koalas: Making an Easy Transition from Pandas to Apache Spark

Recently uploaded

Energy consumption of Database Management - Florina Jonuzi

Green Software Development

Need for Speed: Removing speed bumps from your Symfony projects ⚡️

Łukasz Chruściel

No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception. In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed. We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.

Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris

Neo4j

Artificia Intellicence and XPath Extension Functions

Octavian Nadolu

Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris

Neo4j

Unveiling the Advantages of Agile Software Development.pdf

brainerhub1

What is Augmented Reality Image Tracking

pavan998932

Atelier - Innover avec l’IA Générative et les graphes de connaissances

Neo4j

Atelier - Innover avec l’IA Générative et les graphes de connaissances Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement. Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts

Green Software Development

GreenCode-A-VSCode-Plugin--Dario-Jurisic

Green Software Development

socradar-q1-2024-aviation-industry-report.pdf

SOCRadar

SOCRadar's Aviation Industry Q1 Incident Report is out now! The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers. SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.

LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx

lorraineandreiamcidl

WhatsApp offers simple, reliable, and private messaging and calling services for free worldwide. With end-to-end encryption, your personal messages and calls are secure, ensuring only you and the recipient can access them. Enjoy voice and video calls to stay connected with loved ones or colleagues. Express yourself using stickers, GIFs, or by sharing moments on Status. WhatsApp Business enables global customer outreach, facilitating sales growth and relationship building through showcasing products and services. Stay connected effortlessly with group chats for planning outings with friends or staying updated on family conversations.

8 Best Automated Android App Testing Tool and Framework in 2024.pdf

kalichargn70th171

How to write a program in any programming language

Rakesh Kumar R

Measures in SQL (SIGMOD 2024, Santiago, Chile)

Julian Hyde

SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries. SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL. To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context. A talk at SIGMOD, June 9–15, 2024, Santiago, Chile Authors: Julian Hyde (Google) and John Fremlin (Google) https://doi.org/10.1145/3626246.3653374

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Aftab Hussain

Understanding variable roles in code has been found to be helpful by students in learning programming -- could variable roles help deep neural models in performing coding tasks? We do an exploratory study. - These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia

2024 eCommerceDays Toulouse - Sylius 2.0.pdf

Łukasz Chruściel

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions

Peter Muessig

The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.

E-commerce Application Development Company.pdf

Hornet Dynamics

What is Master Data Management by PiLog Group

aymanquadri279

Recently uploaded (20)

Energy consumption of Database Management - Florina Jonuzi

Need for Speed: Removing speed bumps from your Symfony projects ⚡️

Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris

Artificia Intellicence and XPath Extension Functions

Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris

Unveiling the Advantages of Agile Software Development.pdf

What is Augmented Reality Image Tracking

Atelier - Innover avec l’IA Générative et les graphes de connaissances

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts

GreenCode-A-VSCode-Plugin--Dario-Jurisic

socradar-q1-2024-aviation-industry-report.pdf

LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx

8 Best Automated Android App Testing Tool and Framework in 2024.pdf

How to write a program in any programming language

Measures in SQL (SIGMOD 2024, Santiago, Chile)

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

2024 eCommerceDays Toulouse - Sylius 2.0.pdf

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions

E-commerce Application Development Company.pdf

What is Master Data Management by PiLog Group

Berlin buzzwords 2018

2. When DataFrames fail, resort to mapPartitions aka “What to do when Spark can’t handle it” Berlin Buzzwords, June 2018 Matija Gobec matija.gobec@smartcat.io @mad_max0204

3. Why this talk?

4. Complex problem with an easy solution (or not)

5. 2k+ lines of resulting dataframe tree schema

6. Logically correct vs Correct

7. Logically correct solution ● Data analysis process (data exploration) ● Get to the correct output ASAP ● Validate result ● Run on subset (if possible) ○ Quick iterations ○ Tends to eliminate data issues

8. Correct solution ● Correct result ● Repeatable performance ● Scalable and optimized ● Proper data handling

9. Let’s dig in

10. spark.read.jdbc What in the world? Specify columnName for proper distribution numPartitions defines parallelism Query for bounds (upper and lower) val df = (spark.read.jdbc(url=jdbcUrl, table="some_table", columnName="some_column", lowerBound=1L, upperBound=100000L, numPartitions=100, connectionProperties=connectionProperties))

11. (Re)partitioning df.repartition(100) - repartition to a number of partitions df.repartition(100, col("some_column")) - repartition and split on column df.coalesce(10) - coalesce (narrow dependency)

12. (Re)partitioning Partition by a join field Repartition is expensive but can help down the road coalesce vs repartition HashPartitioner Number of partitions divisible by available cores (usually from 2 to 10 x cores)

13. Data locality PROCESS_LOCAL - data is in the same JVM NODE_LOCAL - data is on the same node (HDFS, local executor) RACK_LOCAL, ANY - data is not on the executors NO_PREF - no locality preference

14. Data locality

15. Work distribution

16. Work distribution

17. Broadcasting variables Keeps a copy of data on each worker Reduces the size of a serialized task as data is on the worker already Optimizes join operations

18. Checkpointing Truncate lineage graph Reliable - HDFS storage (sc.setCheckpointDir) Local - node local temp storage

19. Metrics Spark WebUI is your friend Use Spark History Server JSON metrics over REST API Collect JMX metrics (Dropwizard)

20. Event timeline

21. Storage tab

22. RDD storage details

23. But what if all fails?

24. What is Apache Spark?

25. Let’s use it Spark is a distributed computing framework Spark provides resource management Spark provides controllable work distribution Spark is highly configurable Why not?

26. MapPartitions function /** * Return a new RDD by applying a function to each partition of this RDD. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */ def mapPartitions[U: ClassTag]( f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

27. MapPartitions function - indexed /** * Return a new RDD by applying a function to each partition of this RDD, while tracking the index * of the original partition. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */ def mapPartitionsWithIndex[U: ClassTag]( f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

28. MapPartitions example def mapFunction(args): Iterator[Row] => Iterator[Row] = iterator => { // Processing based on the partition data // Example: // val ids = iterator.toList.map(row => row.getInt(0)) // Processor.doWork(ids) iterator }

29. Other usages Applying custom processing code Running ML pipeline Using third party libraries Misc

30. How about them numbers?

31. Results Processing time dropped from ~96 hours to ~3.5hours (28 times) Adding new entities doesn’t significantly impact complexity/performance Additional control which we leveraged down the road We managed to use all the resources We managed to use all the resources Raised code complexity PROs CONs

32. Q&A

33. Matija Gobec matija.gobec@smartcat.io @mad_max0204 Thank you

Berlin buzzwords 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Berlin buzzwords 2018

Similar to Berlin buzzwords 2018 (20)

Recently uploaded

Recently uploaded (20)

Berlin buzzwords 2018