This document provides an overview of Apache Spark, including what it is, common use cases, and architectural details. Spark is an engine for distributed data processing that allows processing of large datasets across clusters of machines using an in-memory approach. The document discusses Spark APIs, common use cases like ETL, machine learning, and graph processing. It also covers Spark architectural concepts like RDDs, transformations, actions, and optimizations like partitioning, joins, and caching.
Dataframes in Spark - Data Analysts' perspectiveMarcin Szymaniuk
Are you a data analyst who works with Spark and often gets confused by failures you don’t understand? Have you seen a bunch of presentations or blog posts about Spark performance but you are still not certain how to apply the hints you have been given in practice?
Spark is commonly used by people who are not experts in programming but they know SQL and sometimes basic Python. They treat Spark as a tool for getting business value from the the data. And that is how it should be! Although it’s common that queries they run do not work for any obvious reason. This talk is designed for such Spark users and will be focused on common problems with Spark (especially DataFrames and SQL) which can be solved by anyone familiar with SQL. You don’t need to read bytecode to understand the techniques presented and apply them in practice!
This talk will be a case study of multiple DataFrame queries in Spark which initially do not work. I will not only explain how to fix them, but we will go through the solution step-by-step so you will learn what to pay attention to and how to apply similar techniques to your codebase!
Managing large volumes of data isn’t trivial and needs a plan. Fast Data is how we describe the nature of data in a heavily consumer-driven world. Fast in. Fast out. Is your data infrastructure ready? You will learn some important reference architectures for large-scale data problems. The three main areas are covered:
Organize - Manage the incoming data stream and ensure it is processed correctly and on time. No data left behind.
Process - Analyze volumes of data you receive in near real-time or in a batch. Be ready for fast serving in your application.
Store - Reliably store data in the data models to support your application. Never accept downtime or slow response times.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Cloud Native Night, December 2020, talk by Jörg Viechtbauer (Senior Software Architect, QAware)
== Please download slides if blurred! ==
Abstract:
Neural networks like BERT have revolutionized the processing of natural language and achieve state-of-the-art performance in many NLP tasks. One of them is semantic search where documents are found by query intent and not only by exact match.
This talk takes us through the history of information retrieval and shows how keyword search has evolved into the term vector model. The desire for a better search led to the development of the first semantic models like SLI or PLSA. We will see how this culminates today in the use of sophisticated deep neural networks that perform nonlinear dimensional reductions and master long-range dependencies.
Semantic search has never been as good and easy to implement as it is today.
About Jörg:
Jörg is a search expert at QAware and uses neural networks for semantic search and text comprehension. He has spent almost 20 years developing search engines based on both proprietary and open source software for enterprise search, eDiscovery and local search - always hunting for the perfect ranking formula.
Guaranteeing Consensus in Distriubuted Systems with CRDTsSun-Li Beatteay
Consensus in distributed systems has been a debated topic every since programmers discovered they could run the same program on multiple machines. Researchers have been studying consensus for decades, resulting in numerous algorithms and white papers. Unfortunately, many of these algorithms are flawed and unreliable.
However, in 2011, a team of researchers published a paper on a novel approach to distributed consensus using Conflict-free Replicated Data Types (https://hal.inria.fr/inria-00609399v1...). This paper created quite a buzz as it showed that CRDTs were mathematically proven to guarantee consensus through "Strong Eventual Consistency." They also claimed to have solved the CAP conundrum.
This presentation dives into this seminal paper in order to answer the hard questions. What are CRDTs? How do they work? And most importantly, does it actually solve CAP? By the end of this talk, everyone in the audience will have a foundational understanding of CRDTs and how they can be applied to their own work.
Best of all, I will be explaining all of this is as simple language as possible. No advanced math degree required! Sound too good to be true? You'll just have to come see for yourself!
How to find and fix your Oracle application performance problemCary Millsap
How long does your code take to run? Is it changing? When it is slow, WHY is it slow? Is it your fault, or somebody else's? Can you prove it? How much faster could your code be? Do you know how to measure the performance of your code as user workloads and data volumes increase? These are fundamental questions about performance, but the vast majority of Oracle application developers can't answer them. The most popular performance tools available to them—and to the database administrators that run their code in production—are incapable of answering any of these questions. But the Oracle Database can give you exactly what you need to answer these questions and many more. You can know exactly where YOUR CODE is spending YOUR TIME. This session explains how.
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
This document provides summaries of services announced at AWS re:Invent 2018 across multiple categories including compute, storage, database, analytics, machine learning, Internet of Things, developer tools, and security. It outlines new and updated services such as AWS RoboMaker, AWS Amplify Console, AWS Transfer for SFTP, AWS DataSync, and Amazon S3 Batch Operations. It also summarizes pricing and availability for many of the announced services.
Dataframes in Spark - Data Analysts' perspectiveMarcin Szymaniuk
Are you a data analyst who works with Spark and often gets confused by failures you don’t understand? Have you seen a bunch of presentations or blog posts about Spark performance but you are still not certain how to apply the hints you have been given in practice?
Spark is commonly used by people who are not experts in programming but they know SQL and sometimes basic Python. They treat Spark as a tool for getting business value from the the data. And that is how it should be! Although it’s common that queries they run do not work for any obvious reason. This talk is designed for such Spark users and will be focused on common problems with Spark (especially DataFrames and SQL) which can be solved by anyone familiar with SQL. You don’t need to read bytecode to understand the techniques presented and apply them in practice!
This talk will be a case study of multiple DataFrame queries in Spark which initially do not work. I will not only explain how to fix them, but we will go through the solution step-by-step so you will learn what to pay attention to and how to apply similar techniques to your codebase!
Managing large volumes of data isn’t trivial and needs a plan. Fast Data is how we describe the nature of data in a heavily consumer-driven world. Fast in. Fast out. Is your data infrastructure ready? You will learn some important reference architectures for large-scale data problems. The three main areas are covered:
Organize - Manage the incoming data stream and ensure it is processed correctly and on time. No data left behind.
Process - Analyze volumes of data you receive in near real-time or in a batch. Be ready for fast serving in your application.
Store - Reliably store data in the data models to support your application. Never accept downtime or slow response times.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Cloud Native Night, December 2020, talk by Jörg Viechtbauer (Senior Software Architect, QAware)
== Please download slides if blurred! ==
Abstract:
Neural networks like BERT have revolutionized the processing of natural language and achieve state-of-the-art performance in many NLP tasks. One of them is semantic search where documents are found by query intent and not only by exact match.
This talk takes us through the history of information retrieval and shows how keyword search has evolved into the term vector model. The desire for a better search led to the development of the first semantic models like SLI or PLSA. We will see how this culminates today in the use of sophisticated deep neural networks that perform nonlinear dimensional reductions and master long-range dependencies.
Semantic search has never been as good and easy to implement as it is today.
About Jörg:
Jörg is a search expert at QAware and uses neural networks for semantic search and text comprehension. He has spent almost 20 years developing search engines based on both proprietary and open source software for enterprise search, eDiscovery and local search - always hunting for the perfect ranking formula.
Guaranteeing Consensus in Distriubuted Systems with CRDTsSun-Li Beatteay
Consensus in distributed systems has been a debated topic every since programmers discovered they could run the same program on multiple machines. Researchers have been studying consensus for decades, resulting in numerous algorithms and white papers. Unfortunately, many of these algorithms are flawed and unreliable.
However, in 2011, a team of researchers published a paper on a novel approach to distributed consensus using Conflict-free Replicated Data Types (https://hal.inria.fr/inria-00609399v1...). This paper created quite a buzz as it showed that CRDTs were mathematically proven to guarantee consensus through "Strong Eventual Consistency." They also claimed to have solved the CAP conundrum.
This presentation dives into this seminal paper in order to answer the hard questions. What are CRDTs? How do they work? And most importantly, does it actually solve CAP? By the end of this talk, everyone in the audience will have a foundational understanding of CRDTs and how they can be applied to their own work.
Best of all, I will be explaining all of this is as simple language as possible. No advanced math degree required! Sound too good to be true? You'll just have to come see for yourself!
How to find and fix your Oracle application performance problemCary Millsap
How long does your code take to run? Is it changing? When it is slow, WHY is it slow? Is it your fault, or somebody else's? Can you prove it? How much faster could your code be? Do you know how to measure the performance of your code as user workloads and data volumes increase? These are fundamental questions about performance, but the vast majority of Oracle application developers can't answer them. The most popular performance tools available to them—and to the database administrators that run their code in production—are incapable of answering any of these questions. But the Oracle Database can give you exactly what you need to answer these questions and many more. You can know exactly where YOUR CODE is spending YOUR TIME. This session explains how.
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
This document provides summaries of services announced at AWS re:Invent 2018 across multiple categories including compute, storage, database, analytics, machine learning, Internet of Things, developer tools, and security. It outlines new and updated services such as AWS RoboMaker, AWS Amplify Console, AWS Transfer for SFTP, AWS DataSync, and Amazon S3 Batch Operations. It also summarizes pricing and availability for many of the announced services.
Apache Spark - Data intensive processing in practiceMarcin Szymaniuk
The document discusses Spark, an open-source distributed processing engine that can run on Hadoop clusters. It provides an overview of Spark's architecture and capabilities for distributed data processing, streaming, machine learning and SQL queries. Examples of use cases like ETL, analytics and joins on large datasets are presented to illustrate how Spark can be used to process massive amounts of data across clusters.
This document discusses using DataFrames in Spark from a data analyst's perspective. It provides an overview of Spark for data analysts, the Spark execution model, and some case studies. The key points are:
- DataFrames in Spark allow analysts to analyze all data faster without extra data copies by bringing analysis to the data.
- Transformations like joins, groups, and aggregations can be narrow transformations that operate on each partition separately or wide transformations that require data shuffling between partitions.
- Organizing data by date partitions (year/month/day) and repartitioning before partitioning can improve query performance for date-based queries and avoid creating too many small files.
Real Time analytics with Druid, Apache Spark and KafkaDaria Litvinov
This document summarizes Daria Litvinov's presentation on using Druid, Apache Spark, and Kafka for real-time analytics. The presentation covers setting up real-time dashboards using these technologies, addressing issues like data loss on job restarts, and the solution of committing Kafka offsets manually and storing them synchronously.
What’s New in the Upcoming Apache Spark 3.0Databricks
Learn about the latest developments in the open-source community with Apache Spark 3.0 and DBR 7.0The upcoming Apache Spark™ 3.0 release brings new capabilities and features to the Spark ecosystem. In this online tech talk from Databricks, we will walk through updates in the Apache Spark 3.0.0-preview2 release as part of our new Databricks Runtime 7.0 Beta, which is now available.
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
This document provides an overview of Apache Samza, an open source stream processing framework. It discusses why stream processing is useful, Samza's design of processing streams of data across jobs and tasks, how its design is implemented using Apache Kafka for messaging and YARN for resource management, and how to use Samza by developing stream and stateful tasks.
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Managing your black friday logs - Code EuropeDavid Pilato
The document discusses optimally configuring Elasticsearch clusters for ingesting time-based data like logs. It recommends using time-based indices with a new index created each day. It also discusses techniques for scaling clusters by adding more shards as data volumes increase and distributing the data across nodes to avoid bottlenecks. The optimal bulk size for indexing may vary depending on factors like document size and should be tested.
Managing your Black Friday Logs NDC OsloDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
* monitoring architectures
* optimal bulk size
* distributing the load
* index and shard size
* optimizing disk IO
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
Faceted Search – the 120 Million Documents StorySourcesense
Upayavira's presentation at Online Information 2010 in London: the case study of an Enterprise-critical migration from custom Lucene indexes to Apache Solr, with a significant focus on scalability.
The solution needed to providing search against rapidly changing data-sets and multi-million document indexes, enabling complex queries with sub second responses and maintaining high availability.
The document discusses Spark Streaming and its execution model. It describes how Spark Streaming receives data streams, divides them into micro-batches, and processes the micro-batches using Spark. It also covers transformations and actions that can be performed on DStreams, window operations, and deployment options for Spark Streaming including local, standalone, and cluster modes.
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...it-people
Modern Apache Cassandra provides a highly scalable and available database. Some key points covered in the document include:
- Cassandra has been under active development since 2008 and is now at version 2.0, with 2.1 upcoming.
- It is used by many companies for applications such as social media features, logging, notifications, and more due to its abilities around scalability, high availability, and tunable consistency.
- Cassandra uses a decentralized architecture with no single point of failure and dynamic partitioning of data across nodes using a token ring approach for high availability without a single point of failure.
- It provides tunable consistency levels, lightweight transactions, and other features for flexibility while maintaining high
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Cassandra : to be or not to be @ TechTalkAndriy Rymar
This presentation is about Apache Cassandra cluster, data model, read and write operations on node and on cluster and the same about update & delete. Why Delete is overhead and why all engineers have to know business needs to build right architecture and select the best tools.
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
This is the slide deck that was presented at the Hadoop Users Group at LinkedIn on November 5, 2013.
The presentation covers what Samza is, why we built it, and how it works.
This document provides many tips for improving a kanban board, including using different colors to identify different types of work, tracking metrics like lead time and throughput, setting work in progress limits, and ensuring a continuous flow of work through the use of techniques like pull systems and queues. It emphasizes making the board visually clear and focusing on workflow rather than individuals.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
More Related Content
Similar to Apache Spark Data intensive processing in practice
Apache Spark - Data intensive processing in practiceMarcin Szymaniuk
The document discusses Spark, an open-source distributed processing engine that can run on Hadoop clusters. It provides an overview of Spark's architecture and capabilities for distributed data processing, streaming, machine learning and SQL queries. Examples of use cases like ETL, analytics and joins on large datasets are presented to illustrate how Spark can be used to process massive amounts of data across clusters.
This document discusses using DataFrames in Spark from a data analyst's perspective. It provides an overview of Spark for data analysts, the Spark execution model, and some case studies. The key points are:
- DataFrames in Spark allow analysts to analyze all data faster without extra data copies by bringing analysis to the data.
- Transformations like joins, groups, and aggregations can be narrow transformations that operate on each partition separately or wide transformations that require data shuffling between partitions.
- Organizing data by date partitions (year/month/day) and repartitioning before partitioning can improve query performance for date-based queries and avoid creating too many small files.
Real Time analytics with Druid, Apache Spark and KafkaDaria Litvinov
This document summarizes Daria Litvinov's presentation on using Druid, Apache Spark, and Kafka for real-time analytics. The presentation covers setting up real-time dashboards using these technologies, addressing issues like data loss on job restarts, and the solution of committing Kafka offsets manually and storing them synchronously.
What’s New in the Upcoming Apache Spark 3.0Databricks
Learn about the latest developments in the open-source community with Apache Spark 3.0 and DBR 7.0The upcoming Apache Spark™ 3.0 release brings new capabilities and features to the Spark ecosystem. In this online tech talk from Databricks, we will walk through updates in the Apache Spark 3.0.0-preview2 release as part of our new Databricks Runtime 7.0 Beta, which is now available.
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
This document provides an overview of Apache Samza, an open source stream processing framework. It discusses why stream processing is useful, Samza's design of processing streams of data across jobs and tasks, how its design is implemented using Apache Kafka for messaging and YARN for resource management, and how to use Samza by developing stream and stateful tasks.
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Managing your black friday logs - Code EuropeDavid Pilato
The document discusses optimally configuring Elasticsearch clusters for ingesting time-based data like logs. It recommends using time-based indices with a new index created each day. It also discusses techniques for scaling clusters by adding more shards as data volumes increase and distributing the data across nodes to avoid bottlenecks. The optimal bulk size for indexing may vary depending on factors like document size and should be tested.
Managing your Black Friday Logs NDC OsloDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
* monitoring architectures
* optimal bulk size
* distributing the load
* index and shard size
* optimizing disk IO
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
Faceted Search – the 120 Million Documents StorySourcesense
Upayavira's presentation at Online Information 2010 in London: the case study of an Enterprise-critical migration from custom Lucene indexes to Apache Solr, with a significant focus on scalability.
The solution needed to providing search against rapidly changing data-sets and multi-million document indexes, enabling complex queries with sub second responses and maintaining high availability.
The document discusses Spark Streaming and its execution model. It describes how Spark Streaming receives data streams, divides them into micro-batches, and processes the micro-batches using Spark. It also covers transformations and actions that can be performed on DStreams, window operations, and deployment options for Spark Streaming including local, standalone, and cluster modes.
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...it-people
Modern Apache Cassandra provides a highly scalable and available database. Some key points covered in the document include:
- Cassandra has been under active development since 2008 and is now at version 2.0, with 2.1 upcoming.
- It is used by many companies for applications such as social media features, logging, notifications, and more due to its abilities around scalability, high availability, and tunable consistency.
- Cassandra uses a decentralized architecture with no single point of failure and dynamic partitioning of data across nodes using a token ring approach for high availability without a single point of failure.
- It provides tunable consistency levels, lightweight transactions, and other features for flexibility while maintaining high
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Cassandra : to be or not to be @ TechTalkAndriy Rymar
This presentation is about Apache Cassandra cluster, data model, read and write operations on node and on cluster and the same about update & delete. Why Delete is overhead and why all engineers have to know business needs to build right architecture and select the best tools.
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
This is the slide deck that was presented at the Hadoop Users Group at LinkedIn on November 5, 2013.
The presentation covers what Samza is, why we built it, and how it works.
This document provides many tips for improving a kanban board, including using different colors to identify different types of work, tracking metrics like lead time and throughput, setting work in progress limits, and ensuring a continuous flow of work through the use of techniques like pull systems and queues. It emphasizes making the board visually clear and focusing on workflow rather than individuals.
Similar to Apache Spark Data intensive processing in practice (20)
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
2. www.tantusdata.com
About me
• Data Engineer @TantusData
• Have worked for: Spotify, Apple, telcos, startups
• Cluster installations, application architecture and
development, training, data team support
marcin@tantusdata.com
marcin.szymaniuk@gmail.com
@mszymani
16. www.tantusdata.com
Network improvement
• Score historical customer network quality
• Define a model predicting churn based on historical
score
• Simulate base station upgrade and calculate expected
score after the upgrade
• Use the simulated score with churn prediction model
17. www.tantusdata.com
Bring analysis to data
DATA
R / PYTHON
/ SAS
Sample
• Sample only (region, latest month…)
• Coarse aggregate eg. month vs hour (1:720)
19. www.tantusdata.com
Bring analysis to data
DATA
• Analyze all data
• Faster analysis
• No extra data copies (GDPR!)
• Many solutions are already implemented (MLib,
GraphX…)
60. www.tantusdata.com
What to do?
• Understand your data!
• Control the level of parallelism
.config("spark.sql.shuffle.partitions", “2000")
rdd.join(anotherRDD, 2000)
.repartition(2000)
80. www.tantusdata.com
Cache
• Transformations are lazy!
• Re-using RDD/DF means re-calculation!
• Branch in execution plan is a candidate for caching
• You cannot control priority - it's LRU
• Know the size of your RDDs/DF - check Spark UI.