Leverage the Splunk architecture to provide the best possible performance. Whether deploying on premise, in the cloud or on Splunk Cloud, this session will guide you through scenarios that will assist in getting the best from all these options. The agenda also covers how you can plan your searches and reporting to provide the best results for your end users.
SplunkLive Sydney Scaling and best practice for Splunk on premise and in the ...Gabrielle Knowles
Leverage the Splunk architecture to provide the best possible performance. Whether deploying on premise, in the cloud or on Splunk Cloud, this session will guide you through scenarios that will assist in getting the best from all these options. The agenda also covers how you can plan your searches and reporting to provide the best results for your end users.
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
SplunkLive Sydney Scaling and best practice for Splunk on premise and in the ...Gabrielle Knowles
Leverage the Splunk architecture to provide the best possible performance. Whether deploying on premise, in the cloud or on Splunk Cloud, this session will guide you through scenarios that will assist in getting the best from all these options. The agenda also covers how you can plan your searches and reporting to provide the best results for your end users.
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
Using Apache Spark in the Cloud—A Devops Perspective with Telmo OliveiraSpark Summit
Toon is a leading brand in the European smart energy market, currently expanding internationally, providing energy usage insights, eco-friendly energy management and smart thermostat use for the connected home. As value added services become ever more relevant in this market, we have the need to ensure that we can easily and safely on-board new tenants into our data platform. In this talk we’re going to guide you across a less discussed side of using Spark in production – devops. We will speak about our journey from an on-premise cluster to a managed solution in the cloud. A lot of moving parts were involved: ETL flows, data sharing with 3rd parties and data migration to the new environment. Add to this the need to have a multi-tenant environment, revamp our toolset and deploy a live public facing service. It’s possible to find a lot of great examples of how Spark is used for data-science purposes. On the data engineering side, we need to deploy production services, ensure data is cleaned, secured and available, and keep the data-science teams happy. We’d like to share some of the options we took and some of the lessons learned from this (ongoing) transition.
Join Apache Solr committer and Lucidworks engineer Tim Potter for a webinar to learn how to unlock and understand your big data - and get the most out of your Hadoop investment.
This talk was held at the 10th meeting on February 3rd 2014 by Sean Owen.
Having collected Big Data, organizations are now keen on data science and “Big Learning”. Much of the focus has been on data science as exploratory analytics: offline, in the lab. However, building from that a production-ready large-scale operational analytics system remains a difficult and ad-hoc endeavor, especially when real-time answers are required. Design patterns for effective implementations are emerging, which take advantage of relaxed assumptions, adopt a new tiered "lambda" architecture, and pick the right scale-friendly algorithms to succeed. Drawing on experience from customer problems and the open source Oryx project at Cloudera, this session will provide examples of operational analytics projects in the field, and present a reference architecture and algorithm design choices for a successful implementation.
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
SSR: Structured Streaming for R and Machine Learningfelixcss
Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.
Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.
Session hashtag: #SFdev2
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Using Apache Spark in the Cloud—A Devops Perspective with Telmo OliveiraSpark Summit
Toon is a leading brand in the European smart energy market, currently expanding internationally, providing energy usage insights, eco-friendly energy management and smart thermostat use for the connected home. As value added services become ever more relevant in this market, we have the need to ensure that we can easily and safely on-board new tenants into our data platform. In this talk we’re going to guide you across a less discussed side of using Spark in production – devops. We will speak about our journey from an on-premise cluster to a managed solution in the cloud. A lot of moving parts were involved: ETL flows, data sharing with 3rd parties and data migration to the new environment. Add to this the need to have a multi-tenant environment, revamp our toolset and deploy a live public facing service. It’s possible to find a lot of great examples of how Spark is used for data-science purposes. On the data engineering side, we need to deploy production services, ensure data is cleaned, secured and available, and keep the data-science teams happy. We’d like to share some of the options we took and some of the lessons learned from this (ongoing) transition.
Join Apache Solr committer and Lucidworks engineer Tim Potter for a webinar to learn how to unlock and understand your big data - and get the most out of your Hadoop investment.
This talk was held at the 10th meeting on February 3rd 2014 by Sean Owen.
Having collected Big Data, organizations are now keen on data science and “Big Learning”. Much of the focus has been on data science as exploratory analytics: offline, in the lab. However, building from that a production-ready large-scale operational analytics system remains a difficult and ad-hoc endeavor, especially when real-time answers are required. Design patterns for effective implementations are emerging, which take advantage of relaxed assumptions, adopt a new tiered "lambda" architecture, and pick the right scale-friendly algorithms to succeed. Drawing on experience from customer problems and the open source Oryx project at Cloudera, this session will provide examples of operational analytics projects in the field, and present a reference architecture and algorithm design choices for a successful implementation.
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
SSR: Structured Streaming for R and Machine Learningfelixcss
Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.
Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.
Session hashtag: #SFdev2
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
What is Splunk? At the end of this session you’ll have a high-level understanding of the pieces that make up the Splunk Platform, how it works, and how it fits in the landscape of Big Data. You’ll see practical examples that differentiate Splunk while demonstrating how to gain quick time to value.
Taking Splunk to the Next Level – ArchitectureSplunk
Are you outgrowing your initial Splunk deployment? Is Splunk becoming mission critical and you need to make sure it's Enterprise ready? Attend this session led by Splunk experts to learn about taking your Splunk deployment to the next level. Learn about Splunk high availability architectures with Splunk Search Head Clustering and Index Replication. Additionally, learn how to manage your deployment with Splunk’s operational and management controls to manage Splunk capacity and end user experience.
Getting Started with Splunk Enterprise
What is Splunk? At the end of this session you’ll have a high-level understanding of the pieces that make up the Splunk Platform, how it works, and how it fits in the landscape of Big Data. You’ll see practical examples that differentiate Splunk while demonstrating how to gain quick time to value.
Splunk is probably best known along with other Security, Information, and Event Monitoring software for its use in intrusion detection, W/Lan traffic monitoring, and more. But unlike other software systems which rely on modules and add-ons, Splunk offers a robust real-time big data collection and reporting framework complete with its own Search Processing Languages and ready to use point and click reporting tools.
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
Encontro anual para apresentação das novidades da .conf23Rafael Santos
Lançamentos de Produtos: Saiba em primeira mão sobre as últimas atualizações e novidades da plataforma Splunk. Descubra como as novas funcionalidades podem aprimorar sua capacidade de análise de dados e otimizar suas operações.
Casos de Sucesso: Ouça relatos de organizações que alcançaram resultados extraordinários ao implementar o Splunk em suas operações. Aprenda com as melhores práticas e lições aprendidas diretamente daqueles que já estão colhendo os benefícios dessa poderosa solução.
Painéis de Discussão: Junte-se a debates animados e painéis de discussão sobre tópicos relevantes relacionados ao Splunk. Tenha a chance de fazer perguntas aos especialistas e obter insights valiosos.
Networking: Amplie sua rede de contatos e interaja com outros profissionais que compartilham o mesmo interesse no Splunk. Conheça colegas da área, troque experiências e crie conexões significativas.
SplunkLive Sydney Enterprise Security & User Behavior AnalyticsGabrielle Knowles
Leverage the Splunk architecture to provide the best possible performance. Whether deploying on premise, in the cloud or on Splunk Cloud, this session will guide you through scenarios that will assist in getting the best from all these options. The agenda also covers how you can plan your searches and reporting to provide the best results for your end users.
Splunk is a powerful platform for understanding your data. The preview of the Machine Learning Toolkit and Showcase App extends Splunk with a rich suite of advanced analytics and machine learning algorithms, which are exposed via an API and demonstrated in a showcase. In this session, we'll present an overview of the app architecture and API and then show you how to use Splunk to easily perform a wide variety of tasks, including outlier detection, predictive analytics, event clustering, and anomaly detection. We’ll use real data to explore these techniques and explain the intuition behind the analytics.
Splunk is a powerful platform for understanding your data. The preview of the Machine Learning Toolkit and Showcase App extends Splunk with a rich suite of advanced analytics and machine learning algorithms, which are exposed via an API and demonstrated in a showcase. In this session, we'll present an overview of the app architecture and API and then show you how to use Splunk to easily perform a wide variety of tasks, including outlier detection, predictive analytics, event clustering, and anomaly detection. We’ll use real data to explore these techniques and explain the intuition behind the analytics.
Join the Developer workshop to learn about the many options there are for developers to extend and integrate with the Splunk platform by using our various language SDKs, the Web Framework , creating custom components such as Search Commands and Modular Inputs and ultimately understanding the potential opportunity for you in creating your own Splunk Apps.
SplunkLive Melbourne Enterprise Security & User Behavior AnalyticsGabrielle Knowles
This session will review Splunk’s two premium solutions for information security organizations: Splunk for Enterprise Security (ES) and Splunk User Behavior Analytics (UBA). Splunk ES is Splunk's award-winning security intelligence solution that brings immediate value for continuous monitoring across SOC and incident response environments – allowing you to quickly detect and respond to external and internal attacks, simplifying threat management while decreasing risk. Splunk UBA is a new technology that applies unsupervised machine learning and data science to solving one of the biggest problems in information security today: insider threat. You’ll learn how Splunk UBA works in tandem with ES, or third-party data sources, to bring significant automated analytical power to your SOC and Incident Response teams. We’ll discuss each solution and see them integrated and in action through detailed demos.
SplunkLive Perth Enterprise Security & User Behavior AnalyticsGabrielle Knowles
This session will review Splunk’s two premium solutions for information security organizations: Splunk for Enterprise Security (ES) and Splunk User Behavior Analytics (UBA). Splunk ES is Splunk's award-winning security intelligence solution that brings immediate value for continuous monitoring across SOC and incident response environments – allowing you to quickly detect and respond to external and internal attacks, simplifying threat management while decreasing risk. Splunk UBA is a new technology that applies unsupervised machine learning and data science to solving one of the biggest problems in information security today: insider threat. You’ll learn how Splunk UBA works in tandem with ES, or third-party data sources, to bring significant automated analytical power to your SOC and Incident Response teams. We’ll discuss each solution and see them integrated and in action through detailed demos.
Splunk is a powerful platform for understanding your data. The preview of the Machine Learning Toolkit and Showcase App extends Splunk with a rich suite of advanced analytics and machine learning algorithms, which are exposed via an API and demonstrated in a showcase. In this session, we'll present an overview of the app architecture and API and then show you how to use Splunk to easily perform a wide variety of tasks, including outlier detection, predictive analytics, event clustering, and anomaly detection. We’ll use real data to explore these techniques and explain the intuition behind the analytics.
SplunkLive Brisbane Splunk for Operational Security IntelligenceGabrielle Knowles
You have spent a ton of money on your security infrastructure. But how do you string all those things together so you can achieve your goals of reducing time to response, detecting, preventing threats. And most importantly, having your security team serve your business and mission. Learn how to organize your security resources to get the best benefit. See a live demonstration of operationalizing those resources so your security teams can do more for your organization.
The ongoing cyber-war has a front line and that is the endpoint. In this session, you'll learn various methods to improve endpoint security with the Splunk Universal Forwarder and with commercial endpoint solutions. You can gain critical, timely, detailed information about what's happening on your desktops, laptops, hosts, and POS systems. You can correlate this data to network, threat intel, and other data sources. You'll learn how filesystem details, processes, services, hashes, ports, registry settings and more can be used to detect attackers. This will help any organization using Splunk to greatly improve their security posture.
SplunkLive Brisbane Getting Started with IT Service IntelligenceGabrielle Knowles
Are you currently using Splunk to troubleshoot and monitor your IT environment? Do you want more out of Splunk but don’t know how? Here’s your chance to learn more about Splunk IT Service Intelligence (Splunk ITSI) and get hands-on with it for the very first time.
Join the Developer workshop to learn about the many options there are for developers to extend and integrate with the Splunk platform by using our various language SDKs, the Web Framework , creating custom components such as Search Commands and Modular Inputs and ultimately understanding the potential opportunity for you in creating your own Splunk Apps.
SplunkLive Canberra Enterprise Security & User Behavior AnalyticsGabrielle Knowles
This session will review Splunk’s two premium solutions for information security organizations: Splunk for Enterprise Security (ES) and Splunk User Behavior Analytics (UBA). Splunk ES is Splunk's award-winning security intelligence solution that brings immediate value for continuous monitoring across SOC and incident response environments – allowing you to quickly detect and respond to external and internal attacks, simplifying threat management while decreasing risk. Splunk UBA is a new technology that applies unsupervised machine learning and data science to solving one of the biggest problems in information security today: insider threat. You’ll learn how Splunk UBA works in tandem with ES, or third-party data sources, to bring significant automated analytical power to your SOC and Incident Response teams. We’ll discuss each solution and see them integrated and in action through detailed demos.
Splunk is a powerful platform for understanding your data. The preview of the Machine Learning Toolkit and Showcase App extends Splunk with a rich suite of advanced analytics and machine learning algorithms, which are exposed via an API and demonstrated in a showcase. In this session, we'll present an overview of the app architecture and API and then show you how to use Splunk to easily perform a wide variety of tasks, including outlier detection, predictive analytics, event clustering, and anomaly detection. We’ll use real data to explore these techniques and explain the intuition behind the analytics.
SplunkLive Canberra Getting Started with IT Service IntelligenceGabrielle Knowles
Are you currently using Splunk to troubleshoot and monitor your IT environment? Do you want more out of Splunk but don’t know how? Here’s your chance to learn more about Splunk IT Service Intelligence (Splunk ITSI) and get hands-on with it for the very first time.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
14. Forwarding Tier
Syslog-NG Server
Forwarders, LinuxForwarders,
Windows
Deployment
Server
Windows
SharePoint Server
Heavy Forwarder,
Linux
ePO Database
Checkpoint Server
Windows AD
Server
Syslog-NG Server
Indexers
Syslog 514/tcp
& 514/udp
TA-McAfee
(DBConnect)
TA-Checkpoint
Splunk AutoLB to splkidx.internal.door2door.com:9997
Splunk Forwarders to get apps from
splkds.internal.door2door.com:8089
Router
(Physical)
Load
Balancer
Load
Balancer
Firewall
14
16. Data Distribution Imbalance
Even data distribution is crucial in parallel computing
Ways to improve data distribution:
• Enable parallel pipelines on heavy forwarders (In server.conf)
• Route directly from Universal forwarders where possible
• Make the following changes to forwarders’ outputs.conf:
• forceTimebasedAutoLB = true
• autoLBFrequency = x
Examine saved search time windows. Example below has many searches over a 5 minute window, and some searches over 1 minute window,
autoLBFrequency times number of indexers should be divisible by 5 minutes, or 1 minute if possible
|tstats summariesonly=t count WHERE index=“*” by splunk_server _time |timechart span=5m sum(count) by splunk_server
16
6 Indexers; autoLBFrequency = 30
Uneven distribution of workload over 5
minute periods. Unpredictable workload
variation
6 Indexers; autoLBFrequency = 15
Better distribution over 5 minutes.
autoLBFrequency = 10 would be even better
as there are 6 indexers
17. Data Imbalance - Troubleshoot
Troubleshooting:
• Validate firewall rules are in place
• Check that all forwarders have the correct outputs
• Ensure indexers all all listening on proper port
• Does splunkd.log have anything to say?
• Use the Indexing Overview and Configuration Overview (btool saves the day)
Other Causes:
• Simple misconfiguration
• Data processing queues filling up and forwarders timing out and jumping to next indexer
• Check Distributed Indexing Performance in the DMC for queue filling - typical sign of disk performance
issues
• Indexer affinity - the forwarders get stuck to one indexer because EOF never met
• forceTimebasedAutoLB can help! http://blogs.splunk.com/2014/03/18/time-based-load-balancing/
17
38. Distributed Deployment – Common Components
Search-Head 3 X Cisco UCS C220-M4 Rack Servers, each with:
▫ CPU: 2 X E5-2680 v3 (24 cores)
▫ Memory: 256 GB
▫ Cisco 12Gbps SAS modular RAID controller (2GB FBWC cache)
▫ Cisco VIC 1227
▫ 2 X 600GB 15K SFF SAS drives (RAID1)
Admin /Master Nodes 2 X Cisco UCS C220-M4 Rack Servers, each with:
▫ 2 X E5-2620 v3 (12 cores)
▫ Memory: 256 GB
▫ Cisco 12Gbps SAS modular RAID controller (2GB FBWC cache)
▫ Cisco VIC 1227
▫ 2 X 600GB 15K SFF SAS drives (RAID1)
Network Fabric 2 X Cisco UCS 6248UP 48- Port Fabric Interconnects
39. Distributed Deployment – Retention vs. Performance
Distributed Deployment with High Capacity Distributed Deployment with High Performance
Indexer 16 X C240-M4 rack servers, each with:
▫ CPU: 2 X E5-2680 v3 (24 cores)
▫ Memory: 256GB
▫ Cisco 12Gbps SAS modular RAID controller (2GB FBWC cache)
▫ Cisco VIC 1227
▫ 24 X 1.2TB 10K SAS in RAID10
2 X 120GB SSD in RAID1 for OS
16 X C220-M4 rack servers, each with:
▫ CPU: 2 X E5-2680 v3 (24 cores)
▫ Memory: 256GB
▫ Cisco 12Gbps SAS modular RAID controller (2GB FBWC cache)
▫ Cisco VIC 1227
▫ 6 X 800GB SSD-EP in RAID5
▫ 2 X 600GB 10K SFF SAS HDD w/ RAID1 for OS
Retention Capability >1 TB/Day w/ 1 year+ retention >1.25 TB/Day w/ 90 day retention
Indexing Capacity 4 TB/Day 8 TB /Day
Indexing Capacity w/
Replication
2 TB/Day 4 TB/Day
Raw Index Capacity 236 TB 64 TB
Expected Data Capacity At 2:1 compression:
472 TB
At 2:1 compression:
128 TB
Key Use-Cases ▫ Enterprises requiring larger data retention ▫ Ability to support large number of concurrent users that require
faster response time
Servers Count 21 (37 RU) 21 (21 RU)
Scalability ▫ Additional Search-Head(s)
▫ 1 to 16 additional Indexers (refer to High Capacity Indexer
configuration)
▫ Additional Search-Head(s)
▫ 1 to 16 additional Indexers (Refer to High Performance Indexer
configuration)
40. Cloud Deployments
Cloud Considerations
• Authentication restrictions
• Data transfer costs
• Security – SSL Tunnel
• Zones
• Hybrid deployments
VMware http://www.splunk.com/web_assets/pdfs/secure/Splunk_and_VMware_VMs_Tech_Brief.pdf
AWS https://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-amazon-web-services-technical-brief.pdf
Azure http://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-microsoft-azure.pdf
46. High availability across
Indexers & Search
Heads
MultipleAWS
availability zones
Dedicated Cloud
environments
- Secure
- 10x Bursting
Splunk Cloud fully monitored using Splunk Enterprise
Built for 100% Uptime
51. More Is Better?
CPUs
• 8, 12, 16, 24, 32, etc….
• Pipelines - New 6.3 feature for parallelization!
• Indexing can handle higher bursts with multiple index pipeline sets
• Certain searches can be improved with multiple search pipeline sets
• Historical batch – return the data without worrying about time order ( … | stats count)
• Indexers still need to do the heavy lifting (search exists on indexer AND search head)
Memory
• Good for search heads and indexers (16+ GB)
• Benefits from extra RAM used by OS for caching
Disks
• Faster is better - 10k – 15k rpm strongly recommended, SSD preferred
• More disks in RAID 1+0 = Faster
• RAID 5+1 or 6 can be good for Cold buckets
• SSDs can also provide benefit for rare term searches and many concurrent jobs
55. How are things, overall?
High level environment status – quick view of what’s up/down/not reporting:
• Forwarder health - finding forwarders that we haven’t seen for awhile
• Data source health - how are our data feeds doing?
• REST endpoints (| rest /services/server/info) - looking at system information, possibly under provisioned ones
Spotting warnings and errors within Splunk _internal:
• index=_internal sourcetype=splunkd (log_level=ERROR OR log_level=WARN) | cluster showcount=t | table cluster_count host log_level
message | sort – cluster_count | rename cluster_count AS count, log_level AS level
• index=_internal sourceype=splunkd log_level!=INFO | timechart count by component
Track resource usage:
• Say hello to _introspection (Splunk 6.1+)
• Captures disk and other resource metrics (by default on full installs)
• http://docs.splunk.com/Documentation/Splunk/latest/Troubleshooting/Abouttheplatforminstrumentationframework
Dashboards to help save the day:
• Health Status - Splunk Health Overview
• Instance - Distributed Management Console
• Indexing Performance - Distributed Management Console
• Resource Usage - Splunk Health Overview
• License Usage - Splunk Health Overview 55