This document discusses best practices for migrating Spark applications from version 1.x to 2.0. It covers new features in Spark 2.0 like the Dataset API, catalog API, subqueries and checkpointing for iterative algorithms. The document recommends changes to existing best practices around choice of serializer, cache format, use of broadcast variables and choice of cluster manager. It also discusses how Spark 2.0's improved SQL support impacts use of HiveContext.
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
In this presentation, we discuss how to build a datasource from the scratch using spark data source API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_datasource_api
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
In this presentation, we discuss how to build a datasource from the scratch using spark data source API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_datasource_api
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Flink Forward
In this talk, I will present how Flink enables enterprise customers to unify their data processing systems by using Flink to query Hive data.
Unification of streaming and batch is a main theme for Flink. Since 1.9.0, we have integrated Flink with Hive in a platform level. I will talk about:
- what features we have released so far, and what they enable our customers to do
- best practices to use Flink with Hive
- what is the latest development status of Flink-Hive integration at the time of Flink Forward Berlin (Oct 2019), and what to look for in the next release (probably 1.11)
Stitch Fix aspires to help you find the style that you will love. Data, the backbone of the business, is used to help with styling recommendations, demand modeling, user acquisition, and merchandise planning and also to influence business decisions throughout the organization. These decisions are backed by algorithms and data collected and interpreted based on client preferences. Neelesh Srinivas Salian offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way.
Apache Spark plays an important role in Stitch Fix’s data platform, and the company’s data scientists use Spark for their ETL and Presto for their ad hoc queries. The goal for the team running the compute infrastructure is to understand and make the data scientists’ lives easier, particularly in terms of usability of Spark, by building tools that expedite the process of getting started with Spark and transitioning from an ad hoc to a production workflow. The compute infrastructure is a part of the data platform that is responsible for all the needs of data scientists as Stitch Fix.
Neelesh shares Stitch Fix’s journey, exploring its ad hoc and production infrastructure and detailing its in-house tools and how they work in synergy with open source frameworks in a cloud environment. Neelesh also discusses the additional improvements to the infrastructure that help persist information for future use and optimization and explains how the implementation of Amazon’s EMR FS has helped make it easier to read from the S3 source.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Group of Airflow core committers talking about what's coming with Airflow 2.0!
Speakers: Ash Berlin-Taylor, Kaxil Naik, Kamil Breguła Jarek Potiuk, Daniel Imberman and Tomasz Urbaszek.
Introduction to Structured Data Processing with Spark SQLdatamantra
An introduction to structured data processing using Data source and Dataframe API's of spark.Presented at Bangalore Apache Spark Meetup by Madhukara Phatak on 31/05/2015.
Stitch Fix aspires to help you find the style that you will love. Data, the backbone of the business, is used to help with styling recommendations, demand modeling, user acquisition, and merchandise planning and also to influence business decisions throughout the organization. These decisions are backed by algorithms and data collected and interpreted based on client preferences. This talk offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way.
Apache Spark plays an important role in Stitch Fix’s data platform, and the company’s data scientists use Spark for their ETL and Presto for their ad hoc queries. The goal for the team running the compute infrastructure is to understand and make the data scientists’ lives easier, particularly in terms of usability of Spark, by building tools that make it easier to get started with Spark and transition themselves to a daily workflow. The compute infrastructure is a part of the data platform that is responsible for all the needs of data scientists as Stitch Fix.
In this talk, we look at Stitch Fix’s journey, exploring its Spark setup, in-house tools and how they work in synergy with open source frameworks in a cloud environment. There are additional improvements to the infrastructure that help persist information for future use and optimization and we look at how the implementation of Amazon’s EMR FS has helped make it easier for us to read from the S3 source.
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
Pinterest is moving all batch processing to Apache Spark, which includes a large amount of legacy ETL workflows written in Cascading/Scalding. In this talk, we will share the challenges and solutions we experienced during this migration, which includes the motivation of the migration, how to fill the semantic gap between different engines, the difficulty dealing with thrift objects widely used in Pinterest, how we improve Spark accumulators, how to tune the Spark performance after migration using our innovative Spark profiler, and also the performance improvements and cost saving we have achieved after the migration.
The exponential growth of data is not a problem but processing/managing the huge diversity of data is a concern. In this session, we are gonna discuss about one of the most popular big data distributed processing framework "Spark" where we'll be developing API using "Scala", a gained prominence amid big data developers.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
1. Migrating to Spark 2.0 -
Part 2
Moving to next generation spark
https://github.com/phatak-dev/spark-two-migration
2. ● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
3. Agenda
● What’s New in Spark 2.0
● Recap of Part 1
● Sub Queries
● Catalog API
● Hive Catalog
● Refresh Table
● Check Point for Iteration
● References
4. What’s new in 2.0?
● Dataset is the new user facing single abstraction
● RDD abstraction is used only for runtime
● Higher performance with whole stage code generation
● Significant changes to streaming abstraction with spark
structured streaming
● Incorporates learning from 4 years of production use
● Spark ML is replacing MLLib as de facto ml library
● Breaks API compatibility for better API’s and features
5. Need for Migration
● Lot of real world code is written in 1.x series of spark
● As fundamental abstractions are changed, all this code
need to migrate make use performance and API
benefits
● More and more ecosystem projects need the version of
spark 2.0
● 1.x series will be out of maintenance mode very soon.
So no more bug fixes
6. Recap of Part 1
● Choosing Scala Version
● New Connectors
● Spark Session Entry Point
● Built in Csv Connector
● Moving from DF RDD API to Dataset
● Cross Joins
● Custom ML Transforms
8. SubQueries
● A Query inside another query is known as subquery
● Standard feature of SQL
● Ex from MySQL
SELECT AVG(sum_column1)
FROM (SELECT SUM(column1) AS sum_column1
FROM t1 GROUP BY column1) AS t1;
● Highly useful as they allow us to combine multiple
different type of aggregation in one query
9. Types of SubQuery
● In select clause ( Scalar)
SELECT employee_id,
age,
(SELECT MAX(age) FROM employee) max_age
FROM employee
● In from clause (Derived Tables)
SELECT AVG(sum_column1)
FROM (SELECT SUM(column1) AS sum_column1
FROM t1 GROUP BY column1) AS t1;
● In where clause ( Predicate)
SELECT * FROM t1 WHERE column1 = (SELECT column1 FROM t2);
10. SubQuery support in Spark 1.x
● SubQuery support in spark 1.x mimic the support
available in hive 0.12
● Hive only supported the subqueries in from clause so
spark only supported the same.
● The subquery in from clause fairly limited on what they
are capable of doing.
● To support the advanced querying in spark sql, they
needed to add other from subqueries in 2.0
11. SubQuery support in Spark 2.x
● Spark has greatly improved its support on SQL dialect
on 2.0 version
● They have added most of the standard features of
SQL-92 standard
● Full fledged sql parser, no more depend on hive
● Runs all 99 queries TPC-DS natively
● Makes Spark full fledged OLAP query engine
12. Scalar SubQueries
● Scalar subqueries are the sub queries which returns
single ( scalar) result
● There are two kind of Scalar subqueries
○ UnCorrelated Subqueries
The one which doesn’t depend upon the external query
● Correlated Subqueries
The one depend upon the outer queries
13. Uncorrelated Scalar SubQueries
● Add maximum sales amount to each row of the sales
data
● This normally helps us to understand how far away a
given transaction compared to maximum sales we have
done
● In Spark 1.x
sparkone.SubQueries
● In Spark 2.x
sparktwo.SubQueries
14. Correlated SubQueries
● Add maximum sales amount to each row of the sales
data in each item category
● This normally helps us to understand how far away a
given transaction compared to maximum sales we have
done
● In Spark 1.x
sparkone.SubQueries
● In Spark 2.x
sparktwo.SubQueries
16. Catalog in SQL
● Catalog is a metadata store which contains the
metadata of the all the information of a SQL system
● Typical contents of catalog are
○ Databases
○ Tables
○ Table Metadata
○ Functions
○ Partitions
○ Buckets
17. Catalog in Spark 1.x
● By default, spark uses a in memory catalog which keeps
track of spark temp tables
● It is not persistent
● For any persistent operations, spark advocated use of
the hive metastore
● There was no standard API to query metadata
information from the in memory / hive metadata store
● AdHoc functions added to SQLContext over a time to fix
this
18. Need of Catalog API
● Many interactive applications, like notebook systems,
often need an API to query metastore to show relevant
information to user
● Whenever we integrate with hive, without a catalog API
we have to resort to running HQL queries and parsing
it’s information for getting data
● Cannot manipulate hive metastore directly from spark
API
● Need to evolve to support more meta stores in future
19. Catalog API in 2.x
● Spark 2.0 has added a full fledged catalog API to spark
session
● It lives a sparkSession.catalog namespace
● This catalog has API’s to create, read and delete
elements from in memory and also from hive metastore
● Having this standard API to interact with catalog makes
developer much easier than before
● If we were using non standard API’s before, it’s time to
migrate
20. Catalog API Migration
● Migrate from the sqlContext API’s to
sparkSession.catalog API
● Use sparkSession rather than using HiveContext to
have access to the special operations
● Spark 1.x
sparkone.CatalogExample
● Spark 2.x
sparktwo.CatalogExample
22. Hive Integration in Spark 1.x
● Spark SQL had native support for the hive from
beginning
● In beginning spark SQL used hive query parse for
parsing and meta store for persistent storage
● To integrate with hive, one has to create HiveContext
which is separate than SQLContext
● Some of API’s were only available in SQLContext and
some hive specific on HiveContext
● No support for manipulating hive metastore
23. Hive Integration in 2.x
● No more separate HiveContext
● SparkSession has enableHiveSupport API to enable
the hive support
● This makes both spark SQL and Hive API’s consistent
● Spark 2.0 catalog API also supports the hive metastore
● Example
sparktwo.CatalogHiveExample
25. Need of Refresh Table API
● In spark, we cache a table ( dataset) for performance
reasons
● Spark caches the metadata in it’s metadatastore and
actual data in block manager
● If underneath file/table changes, there was no direct API
in spark to force a table refresh
● If you just uncache/recache, it will only reflects the
change in data not metadata
● So we need a standard way to refresh the table
26. Refresh Table and By Path
● Spark 2.0 provides two API’s for refreshing datasets in
spark
● refreshTable API which was imported from
HiveContext are used for or registered temp tables or
hive tables
● refreshByPath API is used for refreshing datasets
without have to register them as tables beforehand
● Spark 1.x
sparkone.RefreshExample
● Spark 2.x - sparktwo.RefreshExample
28. Iterative Programming in Spark
● Spark is one of the first big data framework to have
great support iterative programming natively
● Iterative programs go over the again and again to
compute some results
● Spark ML is one of iterative frameworks in spark
● Even though caching and RDD mechanisms worked
great with iterative programming, moving to dataframe
has created new challenges
30. Iteration in Dataframe API
● As every step of iteration creates new DF, the logical
plan keeps on going big
● As spark needs to keep complete query plan for
recovery, the overhead to analyse the plan increases as
number of iterations increases
● This overhead is compute bound and done at master
● As this overhead increases, it makes iteration very very
slow
● Ex : sparkone.CheckPoint
31. Solution to Query Plan Issue
● To solve ever growing query plan (lineage) we need to
truncate it to make it faster
● Whenever we truncate the query plan, we will lose the
ability to recover
● To avoid that, we need to store the intermediate data
before we truncated the query plan
● Saving intermediate data with truncation of the query
plan will result in faster performance
32. Dataset Persistence API
● In spark 2.1, there is a new persist API on dataset
● It’s analogues to RDD persist API
● In RDD, persist used to persist the RDD and then
truncate it’s lineage
● Similarly in case of dataset, persist will persist the
dataset and then truncates it’s query plan
● Make sure that persist times are much lower than the
overhead you are facing
● Ex : sparktwo.CheckPoint
34. Best Practice Migration
● As fundamental abstractions has changed in spark 2.0,
we need to rethink our best practices
● Many best practices were centered around RDD
abstraction which is no more central abstraction
● Also there are many optimisation in the catalyst where
many optimisation is done for us by platform
● So let’s look at some best practices of 1.x and see how
we can change
35. Choice of Serializer
● Use kryo serializer over java and register classes
with kryo
● This best practice was devised for efficient caching and
transfer of RDD data
● But in spark 2.0, Dataset uses custom code generated
serialization framework for most of the code and data
● So unless there is heavy use of RDD in your project you
don’t need to worry about serializer is 2.0
36. Cache Format
● RDD uses MEMORY_ONLY as default and it’s most
efficient caching
● DataFrame/Dataset uses MEMORY_AND_DISK rather
than MEMORY_ONLY
● Computation of Dataset and converting to custom
serialization format is often costly
● So use MEMORY_AND_DISK as default format over
MEMORY_ONLY
37. Use of BroadCast variables
● Use broadcast variable for optimising the lookups and
joins
● BroadCast variables played an important role of making
joins efficient in the RDD world
● These variables don’t have much scope in Dataset API
land
● By configuring broadcast value, spark sql will do
broadcasting automatically
● Don’t use them unless there is a reason
38. Choice of Clusters
● Use YARN/Mesos for production. Standalone is
mostly for simple apps
● Users were encouraged to use a dedicated cluster
manager over standalone given by spark
● With databricks cloud putting weight behind standalone
cluster it has been ready for production grade
● Many companies run their spark applications in
standalone cluster today
● Choose standalone if you run only spark applications
39. Use of HiveContext
● Use HiveContext over SQLContext for using Spark
SQL
● In spark 1.x, Spark SQL was simplistic and heavily
depended on hive to give query parsing
● In spark 2.0, spark sql is enriched which is now more
powerful than hive itself
● Most of the udf of hive are now rewritten in spark sql
and code generated
● Unless you want to use hive metastore, use
sparksession without hive