A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
The Query Optimizer is the “brain” of your Postgres database. It interprets SQL queries and determines the fastest method of execution. Using the EXPLAIN command , this presentation shows how the optimizer interprets queries and determines optimal execution.
This presentation will give you a better understanding of how Postgres optimally executes their queries and what steps you can take to understand and perhaps improve its behavior in your environment.
To listen to the webinar recording, please visit EnterpriseDB.com > Resources > Ondemand Webcasts
If you have any questions please email sales@enterprisedb.com
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...HostedbyConfluent
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo | Current 2022
Seven years ago, O’Reilly published the Streaming 101 and Streaming 102 articles, which have long served as the gateway for many undertaking their own journey into the world of stream processing. But while those articles have remained remarkably relevant over the years, they have also remained entirely static. Meanwhile, those of us pushing forward the envelope of stream processing have not.
Join Tyler Akidau, original author of those articles, and Dan Sotolongo, longtime collaborator and expert in streaming theory, as they reframe the fundamentals of stream processing in modern terms. What is streaming? What theoretical changes do we need to grow beyond the limitations of stream processing systems of the past? What wisdom can we draw from other disciplines to inform these changes? And if we transcend these limitations, can we make life simpler for everyone?
This talk will cover the key concepts of stream processing theory as we understand them today. It is simultaneously an introductory talk as well as an advanced survey on the breadth of stream processing theory. Anyone with an interest in streaming should find something engaging within.
Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as:
* Improving Dataset performance with automated column pruning
* Bringing an efficient 2d join algorithm to Spark SQL
* Fixing join skewness with adaptive execution
* Enhancing the cost-optimizer with a history-based learning approach
In our experience, many problems with production workflows can be traced back to unexpected values in the input data. In a complex pipeline, it can be difficult and costly to trace the root cause of errors. Here we outline our work developing an open source data validation framework built on Apache Spark. Our goal is a tool that easily integrates into existing workflows to automatically make data validation a vital initial step of every production workflow. Our tool is aimed at data scientists and data engineers, who are not necessarily Scala/Python programmers. Our users specify a configuration file that details the data validation checks to be completed. This configuration file is parsed into appropriate queries that are executed with Apache Spark. A status report is logged, which is used to notify developers/maintainers and to establish a historical record of validator checks. This work was inspired by the many great ideas behind Google's TensorFlow Extended (TFX) platform, in particular TensorFlow Data Validation (TFDV). As such we provide optional functionality for our users to visualize their data using Facets Overview and Facets Dive.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
The Query Optimizer is the “brain” of your Postgres database. It interprets SQL queries and determines the fastest method of execution. Using the EXPLAIN command , this presentation shows how the optimizer interprets queries and determines optimal execution.
This presentation will give you a better understanding of how Postgres optimally executes their queries and what steps you can take to understand and perhaps improve its behavior in your environment.
To listen to the webinar recording, please visit EnterpriseDB.com > Resources > Ondemand Webcasts
If you have any questions please email sales@enterprisedb.com
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...HostedbyConfluent
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo | Current 2022
Seven years ago, O’Reilly published the Streaming 101 and Streaming 102 articles, which have long served as the gateway for many undertaking their own journey into the world of stream processing. But while those articles have remained remarkably relevant over the years, they have also remained entirely static. Meanwhile, those of us pushing forward the envelope of stream processing have not.
Join Tyler Akidau, original author of those articles, and Dan Sotolongo, longtime collaborator and expert in streaming theory, as they reframe the fundamentals of stream processing in modern terms. What is streaming? What theoretical changes do we need to grow beyond the limitations of stream processing systems of the past? What wisdom can we draw from other disciplines to inform these changes? And if we transcend these limitations, can we make life simpler for everyone?
This talk will cover the key concepts of stream processing theory as we understand them today. It is simultaneously an introductory talk as well as an advanced survey on the breadth of stream processing theory. Anyone with an interest in streaming should find something engaging within.
Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as:
* Improving Dataset performance with automated column pruning
* Bringing an efficient 2d join algorithm to Spark SQL
* Fixing join skewness with adaptive execution
* Enhancing the cost-optimizer with a history-based learning approach
In our experience, many problems with production workflows can be traced back to unexpected values in the input data. In a complex pipeline, it can be difficult and costly to trace the root cause of errors. Here we outline our work developing an open source data validation framework built on Apache Spark. Our goal is a tool that easily integrates into existing workflows to automatically make data validation a vital initial step of every production workflow. Our tool is aimed at data scientists and data engineers, who are not necessarily Scala/Python programmers. Our users specify a configuration file that details the data validation checks to be completed. This configuration file is parsed into appropriate queries that are executed with Apache Spark. A status report is logged, which is used to notify developers/maintainers and to establish a historical record of validator checks. This work was inspired by the many great ideas behind Google's TensorFlow Extended (TFX) platform, in particular TensorFlow Data Validation (TFDV). As such we provide optional functionality for our users to visualize their data using Facets Overview and Facets Dive.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Druid provides sub-second query latency and Flink provides SQL on streams allowing rich transformation/enrichment of events as it happens. In this talk we will learn how Lyft
uses flink sql and druid together to support real time analytics.
Meetup: https://www.meetup.com/druidio/events/252515792/
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.
Containerized Stream Engine to Build Modern Delta LakeDatabricks
As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
This is an interactive introduction to R.
R is an open source language for statistical computing, data analysis, and graphical visualization.
While most commonly used within academia, in fields such as computational biology and applied statistics, it is gaining currency in industry as well – both Facebook and Google use R within their firms.
Angular 2 очень сильно изменился по сравнению с первой версией. В этом докладе Александр расскажет об общей архитектуре нового фреймворка, о dependency injection, о взаимодействии компонентов, маршрутизации, о компиляторе, а также о подходах к развёртыванию приложений.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Druid provides sub-second query latency and Flink provides SQL on streams allowing rich transformation/enrichment of events as it happens. In this talk we will learn how Lyft
uses flink sql and druid together to support real time analytics.
Meetup: https://www.meetup.com/druidio/events/252515792/
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.
Containerized Stream Engine to Build Modern Delta LakeDatabricks
As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
This is an interactive introduction to R.
R is an open source language for statistical computing, data analysis, and graphical visualization.
While most commonly used within academia, in fields such as computational biology and applied statistics, it is gaining currency in industry as well – both Facebook and Google use R within their firms.
Angular 2 очень сильно изменился по сравнению с первой версией. В этом докладе Александр расскажет об общей архитектуре нового фреймворка, о dependency injection, о взаимодействии компонентов, маршрутизации, о компиляторе, а также о подходах к развёртыванию приложений.
Après avoir fait ce talk à la conférence NSSpain, Simone Civetta va nous expliquer sur quelles métriques il est possible de se baser pour évaluer la qualité d’un code source. Cette question étant toujours sujette à débat, préparez vos arguments !
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019corehard_by
C++ is known for things such as performance, expressiveness, the lack of a standard build system and package management, complexity and long compile times. The inability to iterate quickly is one of the biggest killers of productivity. This talk is aimed at anyone interested in improving the last of these points - it will provide insights into why compilation (and linking) take so long for C++ and will then provide an exhaustive list of techniques and tools to mitigate the problem, such as: - tooling and infrastructure - hardware, build systems, caching, distributed builds, diagnostics of bottlenecks, code hygiene - techniques - unity builds, precompiled headers, linking (static vs shared libraries) - source code modification - the PIMPL idiom, better template use, annotations - modules - what they are, when they are coming to C++ and what becomes obsolete because of them
Effective C++ (Myers has nothing to do with it :). The C++ ecosystem is still rapidly evolving, which makes this language one of the most effective tools of our time. I'd like to point out three reasons why C++ is so attractive nowadays. The first is the new extensions to the language standard that enable programmers to write effective and efficient code. The second is the maturity of the development tools and acceleration of the code building process. The third is the maturity of the auxiliary tools allowing us to keep control over the code quality and other aspects of the software development life cycle. This talk is an ode to the C++ language!
Использование AzureDevOps при разработке микросервисных приложенийVitebsk Miniq
Презентация подготовлена по материалам выступления Игоря Сычёва на витебском MiniQ#17, который был проведен 25 июля 2019:
https://vk.com/miniq17;
https://communities.by/events/miniq-vitebsk-17.
Про доклад:
Мы реализуем CI/CD на базе Azure DevOps для нашего приложения в МикроСервисном стиле, которое хостим на Azure Kubernetes Services на протяжении более чем 6 месяцев. Мы хотим поделиться нашими успехами и ошибками в CI/CD с разработчиками и DevOps инженерами. Мы продемонстрируем наши подходы и реализации к Build/Release, созданию сред тестирования с использованием ARM шаблонов, согласования установки приложения на рабочие среды и эволюцию этих процессов со временем.
SPARKNaCl: A verified, fast cryptographic libraryAdaCore
SPARKNaCl https://github.com/rod-chapman/SPARKNaCl is a new, freely-available, verified and fast reference implementation of the NaCl cryptographic API, based on the TweetNaCl distribution. It has a fully automated, complete and sound proof of type-safety and several key correctness properties. In addition, the code is surprisingly fast - out-performing TweetNaCl's C implementation on an Ed25519 Sign operation by a factor of 3 at all optimisation levels on a 32-bit RISC-V bare-metal machine. This talk will concentrate on how "Proof Driven Optimisation" can result in code that is both correct and fast.
NovaProva, a new generation unit test framework for C programsGreg Banks
I wrote NovaProva because I was sick of writing unit tests using the venerable but clunky CUnit library. CUnit looked easy when I started writing unit tests but I soon discovered it's limitations and ended up screaming in frustration. Meanwhile over 18 months the Cyrus IMAP server project has gone from 0 unit tests to 461, of which 277 are written in C with CUnit. So that's a big itch!
Performance Benchmarking: Tips, Tricks, and Lessons LearnedTim Callaghan
Presentation covering 25 years worth of lessons learned while performance benchmarking applications and databases. Presented at Percona Live London in November 2014.
C++ in kernel mode, Roman Beleshev
Вы когда-нибудь писали драйвера для Windows? А на С++? Пора развенчать миф о том, что драйверописательство - это только С и только хардкор. О различиях между Kernel mode и User mode, о технических моментах реализации некоторых возможностей С++, и о том, что писать драйвера на С++ можно, нужно и очень приятно и увлекательно.
JavaScript and popular programming paradigms (OOP, AOP, FP, DSL). Overview of the language to see what tools we can leverage to reduce complexity of our projects.
This part goes over more language features and looks at FP, and DSLs with JavaScript.
The presentation was delivered at ClubAJAX on 3/2/2010.
Blog post: http://lazutkin.com/blog/2010/mar/4/exciting-js-2/
Beginning is Part I: http://www.slideshare.net/elazutkin/exciting-javascript-part-i
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
2. Things I’d like the compiler to tell me:
• I couldn’t inline that function
…because I had only its declaration.
• I’m re-evaluating these expressions on every loop iteration
…because I couldn’t prove they stay fixed.
• I’m reloading this variable from memory many times
…because I don’t know whether unrelated code lines change it.
2
3. Some raw optimization data is exposed
• Clang/gcc: -Rpass
code.cc:4:25: remark: foo inlined into bar [-Rpass=inline]
• Intel: -qopt-report=[0..5]
• Microsoft (very partial):
/Qpar-report, /Qvec-report
3
6. Opt-Viewer Usage
• Build with an extra clang switch:
-fsave-optimization-record
*.opt.yaml files are generated, by
default in the obj folder.
• Generate htmls:
$ opt-viewer.py
--output-dir <htmls folder>
--source-dir <repo>
<yamls folder>
6
14. Target Developers, Not Compiler Authors
• Denoise:
• Collect only optimization failures
• Ignore system headers
• Display a single entry per type/source loc (in index)
• …
• ~1.5M lines ==> 22K lines
• Don’t mention ‘passes’
• Make the index sortable, resizable & pageable
• Abridged function names
• ...
14
15. Example - OpenCV
• file://wsl.localhost/Ubuntu-20.04/home/ofek/html/opencv-
clean/modules/photo/index.html
15
23. Aliasing: Counter-measures
• LTO ?
• __restrict__
• Non standard, tells the compiler an argument doesn’t alias with anything else
• __attribute__((pure)), __attribute__((const))
• Non standard, tell the compiler a function doesn’t write/read global state.
• Potentially resolve ‘clobbered by call’ opt failures,
• Strict-Aliasing
• The only C++ standard compliant way.
23
24. Strict Aliasing
• “Strict aliasing is an assumption made by the compiler, that
objects of different types will never refer to the same
memory location (i.e. alias each other.)”
Mike Acton https://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html
• Allowed optimization. on by default for -O2 +
• char* is an exception – always allowed to alias with any type.
• Can be disabled with –fno-strict-aliasing (pessimization!)
24
25. Strict Aliasing: Forcing type difference
• typedef / using ?
• Inherit int?
• “I considered …to allow derivation from built-in classes … However, I
restrained myself. … the C conversion rules are so chaotic that pretending that
int, short, etc., are well-behaved ordinary classes is not going to work. They
are either C compatible, or they obey the relatively well-behaved C++ rules
for classes, but not both.”
Bjarne Stroustrup, The Design and Evolution of C++, §15.11.3.
• Strong-Typedefs
• Wrappers
25
31. GCC work
• https://gcc.gnu.org/legacy-ml/gcc-patches/2018-
05/msg01675.html
• https://github.com/davidmalcolm/gcc-opt-viewer
• YAML -> JSON.gz
• Actually a better choice
https://stackoverflow.com/questions/27743711/can-i-speedup-yaml
• Active only during 2018
• Currently seems unusable
https://github.com/davidmalcolm/gcc-opt-viewer/issues/3
31
Very little known LLVM feature, I really think can add value to your day to day work. Certainly if you’re working in clang
I’ll present first the LLVM existing opt-viewer tool, then optview2 which is derived work by me,
And about half the time would be dedicated to demonstrations of this tool in use. With specific focus on aliasing issues.
I want to gain visibility into compiler optimizations. More specifically, I want to see where the compiler tried optimizations and failed – there might be something I could do about it.
LLVM was already instrumented to show optimization choices (for –Rpass), Hal Finkel wrapped the data in yaml format for machine consumption.
Adam Nemet wrote the opt-viewer python scripts
A few small python scripts that process the compiler-emitted remarks into annotated source htmls, and an index page – which is an optimization dashboard of sorts.
Don’t know, guessing
I stopped it at 60G mem
Ran it on separate subsets of a real project
An extra step is needed to make this usable to us
Let’s take that extra step.
The demos are of results of the optview2 version, but let it be said that the entire heavy lifting is done by the original code
Credit to Ilan Ben Hagai for javascript wizardry
Full opt-viewer, not just failures
file:///C:/optview2/html/opencv4/modules_imgproc_src_clahe.cpp.html#L201
One of the darker corners corners of C++, already full of dark corners
Some guesswork in interpreting the optview2 reports
Restrict –compiles on non arguments, didn’t see any impact
Turning on strict-aliasing means allowing the compiler to assume exactly that.
Turning it off is a pessimization
What can be done? Force type difference
This is not good C++ code
Don’t do this for fun. Do this in code where you *really* care about cycle-level performance.
Hopefully compilers would get good enough to not have to do this.
Probably
If your project *can* be built by clang – I suggest you give it a go.
Report LLVM bugs:
alias analysis bugs had no observable symptoms until now. There are no diagnostics emitted on them and they don’t result in bad codegen
So, I suspect there are plenty of them.
Tests done on WSL2, sys calls have overhead – but in both repos