Presentation from the Boulder/Denver Big Data Meetup on 2/20/2020 in Boulder, CO. Topics covered: Troubleshooting Spark jobs (groupby, shuffle) for big data, tuning AWS EMR Spark clusters, EMR cluster resource utilization, writing scaleable Scala for scanning S3 metadata.
Presentation from the Boulder/Denver Big Data Meetup on 2/20/2020 in Boulder, CO. Topics covered: Troubleshooting Spark jobs (groupby, shuffle) for big data, tuning AWS EMR Spark clusters, EMR cluster resource utilization, writing scaleable Scala for scanning S3 metadata.
[4DEV][Łódź] Ivan Vaskevych - InfluxDB and Grafana fighting together with IoT...PROIDEA
They promise that IoT (Internet of Things) will conquer the world. But what will tackle billions of bytes that flow into our servers every hour?
First released in 2013, InfluxDB is used by eBay, Cisco, IBM and other big companies. It’s a production proven time-series storage.
During this talk we're going to get acquainted with it and see how InfluxDB can help to solve your problems.
We’ll see how to quickly install it on Amazon Web Services platform and how it scales.
And for the dessert, we’re going to draw pretty Grafana graphs using InfluxDB data.
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLONOutlyer
Video:
Paul Dix (Founder of InfluxDB) talking about his awesome Open-Source projects for monitoring.
For more info visit: InfluxDB: www.influxdb.com
Join DevOps Exchange London here: http://www.meetup.com/DevOps-Exchange-London/
Follow DOXLON on twitter: twitter.com/doxlon
Presentation from the Boulder/Denver Big Data Meetup on 2/20/2020 in Boulder, CO. Topics covered: Troubleshooting Spark jobs (groupby, shuffle) for big data, tuning AWS EMR Spark clusters, EMR cluster resource utilization, writing scaleable Scala for scanning S3 metadata.
[4DEV][Łódź] Ivan Vaskevych - InfluxDB and Grafana fighting together with IoT...PROIDEA
They promise that IoT (Internet of Things) will conquer the world. But what will tackle billions of bytes that flow into our servers every hour?
First released in 2013, InfluxDB is used by eBay, Cisco, IBM and other big companies. It’s a production proven time-series storage.
During this talk we're going to get acquainted with it and see how InfluxDB can help to solve your problems.
We’ll see how to quickly install it on Amazon Web Services platform and how it scales.
And for the dessert, we’re going to draw pretty Grafana graphs using InfluxDB data.
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLONOutlyer
Video:
Paul Dix (Founder of InfluxDB) talking about his awesome Open-Source projects for monitoring.
For more info visit: InfluxDB: www.influxdb.com
Join DevOps Exchange London here: http://www.meetup.com/DevOps-Exchange-London/
Follow DOXLON on twitter: twitter.com/doxlon
Biblia Hebraica Stuttgartensia Amstelodamensis. Coding the Hebrew Bible with an Open Science ethos: Text-Fabric.
Text-Fabric is several things: (1) a browser for ancient text corpora; (2) a Python3 package for processing ancient corpora
A corpus of ancient texts and linguistic annotations represents a large body of knowledge. Text-Fabric makes that knowledge accessible to non-programmers by means of built-in a search interface that runs in your browser.
From there the step to program your own analytics is not so big anymore. Because you can call the Text-Fabric API from your Python programs, and it works really well in Jupyter notebooks.
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLONOutlyer
Video: http://youtu.be/a1r2bpGQbBQ
A talk about using Riemann for stream processing. This is the metric aficionados on-premise tool of choice currently. Ali will talk about how they are using it to process the metrics coming out of their cloud service.
For more info see : http://riemann.io
Join DevOps Exchange London here: http://www.meetup.com/DevOps-Exchange-London/
Follow DOXLON on twitter: twitter.com/doxlon
First impressions of SparkR: our own machine learning algorithmInfoFarm
In june 2015, SparkR was first integrated into SparkR. At InfoFarm we strive to stay on top of new technologies, hence we have tried it out and implemented a few machine learning algorithms as well.
How LogicMonitor manages resources in AWS using Terraform to provide a reliable, repeatable way to both naturally grow our infrastructure and provide disaster recovery solutions.
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R.
Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
Comparing the burst buffers of today, such as the Cray DataWarp-based burst buffer implemented on NERSC Cori, to the proto-burst buffer deployed on SDSC's Gordon supercomputer in 2012.
Biblia Hebraica Stuttgartensia Amstelodamensis. Coding the Hebrew Bible with an Open Science ethos: Text-Fabric.
Text-Fabric is several things: (1) a browser for ancient text corpora; (2) a Python3 package for processing ancient corpora
A corpus of ancient texts and linguistic annotations represents a large body of knowledge. Text-Fabric makes that knowledge accessible to non-programmers by means of built-in a search interface that runs in your browser.
From there the step to program your own analytics is not so big anymore. Because you can call the Text-Fabric API from your Python programs, and it works really well in Jupyter notebooks.
Ali Asad Lotia (DevOps at Beamly) - Riemann Stream Processing at #DOXLONOutlyer
Video: http://youtu.be/a1r2bpGQbBQ
A talk about using Riemann for stream processing. This is the metric aficionados on-premise tool of choice currently. Ali will talk about how they are using it to process the metrics coming out of their cloud service.
For more info see : http://riemann.io
Join DevOps Exchange London here: http://www.meetup.com/DevOps-Exchange-London/
Follow DOXLON on twitter: twitter.com/doxlon
First impressions of SparkR: our own machine learning algorithmInfoFarm
In june 2015, SparkR was first integrated into SparkR. At InfoFarm we strive to stay on top of new technologies, hence we have tried it out and implemented a few machine learning algorithms as well.
How LogicMonitor manages resources in AWS using Terraform to provide a reliable, repeatable way to both naturally grow our infrastructure and provide disaster recovery solutions.
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R.
Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
Comparing the burst buffers of today, such as the Cray DataWarp-based burst buffer implemented on NERSC Cori, to the proto-burst buffer deployed on SDSC's Gordon supercomputer in 2012.
Operating and Supporting Delta Lake in ProductionDatabricks
Delta lake is widely adopted. There are things to be aware of when dealing with petabytes of data in Delta Lake. These smart decisions can give the best efficiency and increase the adoption of Delta. Best practices like OPTIMIZE, ZORDER have to wisely chosen. We have support stories where we successfully resolved performance issues by applying the right performance strategy. There are a set of common issues or repeated questions from our strategic customers face when using Delta and in this session we cover them and how to address them.
Building Apache Cassandra clusters for massive scaleAlex Thompson
Covering theory and operational aspects of bring up Apache Cassandra clusters - this presentation can be used as a field reference. Presented by Alex Thompson at the Sydney Cassandra Meetup.
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala
Talk given by Reynold Xin at Scala Days SF 2015
In this talk, Reynold talks about the underlying techniques used to achieve high performance sorting using Spark and Scala, among which are sun.misc.Unsafe, exploiting cache locality, high-level resource pipelining.
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Presented by Landon Robinson and Jack Chapa
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
a comprehensive good introduction to the the Big data world in AWS cloud, hadoop, Streaming, batch, Kinesis, DynamoDB, Hbase, EMR, Athena, Hive, Spark, Piq, Impala, Oozie, Data pipeline, Security , Cost, Best practices
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Slides from PyData London exploring how the big data ecosystem (currently) works together as well as how different parts of the ecosystem work with Python. Proof-of-concept examples are provided using nltk & spacy with Spark. Then we look to the future and how we can improve.
Similar to Spark Gotchas and Lessons Learned (2/20/20) (20)
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
1. Spark Gotchas and Lessons Learned
Jen Waller, Ph.D.
Boulder/Denver Big Data Meetup
Feb 20, 2020
Boulder, CO
2. Overview
● Overall Dev Approach
● Useful Spark Built-Ins
● How to Fail at Scale
● Resource Utilization
3. “Strategery”
● Local machine; simulated
cluster
● Spark-shell/spark-submit
● Tiny subset of data (even
better: TDD w/
programmatically generated
data!)
● Real cluster
● Start tiny: Test functions/ configs
specific to cloud
● Bigger cluster for load testing
● Spark-shell = handy for quick
iteration on manual cluster configs,
load testing one fxn at a time
5. What about notebooks?
By Sam Shere (1905–1982) - Zeppelin-ramp de Hindenburg / Hindenburg
zeppelin disaster, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=19329337
6. What about notebooks?
By Sam Shere (1905–1982) - Zeppelin-ramp de Hindenburg / Hindenburg
zeppelin disaster, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=19329337
7. Spark UI & Spark History Server
● Can access anywhere (local, cloud)
● Jobs/tasks, execution plans, memory
usage, configs
● Maximizing utility of metrics data
○ Set labels for task groups and jobs
using sparkContext
○ Break jobs and tasks apart by
repartitioning, even dumping to disk
8. REST API & Metrics Sink(s)
● REST API
○ curl http://localhost:4040/api/v1/applications
● Can configure a set of sinks for:
○ Master, applications, worker, executor,
driver, shuffleService, applicationMaster
(YARN)
● And send metrics to:
○ Console, CSV file, JMX console, within
Spark UI as JSON, Graphite node, slf4j,
StatsD node
10. Don’t Overload Data Store APIs
Avoid full scans of all partitions:
val df = spark
.read
.parquet(“s3://mybucket/mydata”)
.filter(col(“mycolumn”).equalTo(“someDate”))
You can still read in data as partitioned without scanning entire table:
val df = spark
.read
.option(“basePath”, “s3://mybucket/mydata”)
.parquet(“s3://mybucket/mydata/someDate”)
11. Use Built-In Optimizations for Reading Data
● Automatic detection of partitions and efficient data read
○ Provide the basepath when reading in partitions
○ Always provide a schema to prevent repeated schema checking
● Columnar data: Parquet/ORC reader
○ Projection pushdown = only read the columns you need
○ Predicate/filter pushdown = use metadata to only read in the
rows you need
12. Beware the Shuffle!
● GroupBy, Join, Distinct…
● Amazon suggests avoiding shuffle
entirely.
● Do that! Find another way to aggregate
your data (i.e., aggregate it upstream in
Kafka/Kinesis/Flink, index it in
ElasticSearch - there are many good
options)
13. If you must shuffle… Know your data.
● Check for repeated values, nulls on join columns
○ Joining data with repeated values on both sides → gigantic result
○ Joining cols with nulls → massive skew.
■ Can “salt” nulls by pre-filling arbitrary values into empty cells
○ Cluster resource use could be throttling broadcast joins (check it!)
● Check for skew
○ Grouping by skewed column → Spark naively assigns rows to executors
based on level of skewed column
■ Application == dead (out of memory, network timeouts, lost nodes,
processes that never end)
14. Controlling Spark Shuffles
● Partition your data so it’s mapped across cluster evenly
○ Partition by unique ID
○ Avoid partitioning on cols with a lot of nulls, missing or skewed values
● Partition data to match job you’re running
○ Parallel transforms on many datasets: 200 partitions
○ Billions of pairwise comparisons: 4-10k partitions
○ Tests on single server/locally: 1 partition
15. Hacks for Shuffling Skewed Data
● Limit the job to a single level of skewed variable at a time
(serialize).
● Manually set a small broadcast blockSize to fit the size of the
instance types in your cluster.
● Salt the data
17. EMR Cluster Resource Gotchas
● By default, Yarn assigns only one vCPU per executor.
● If maximizeResourceAllocation = true, you get only one executor on each
node (i.e., one Yarn container/executor per machine).
● Poor use of resources.
● Lack of parallelism = bad for things that benefit from parallelism, like
broadcast joins.
18. This gets really messy if
multiple applications
running on one cluster.
19. How to get Spark to use >= 1 vCPU/machine?
Manually change memory
allocated to
executors/driver?
Nope.
20. How to get Spark to use >= 1 vCPU/machine?
Change Yarn configs!
21. Great! Except…
● Unless you manually set the
number of cores used by the
driver, it’s 1.
● Which is fine unless you switch
to larger instance types…
● Then you should manually
configure cluster resources.
Image credit: https://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
22. Summary
● Spark is awesome, but can be tricky.
● Read the docs! Use those helpful Spark built-ins.
● Avoid/manage shuffling.
● Use the Hadoop UI to check your resource utilization.