- Splunk uses a MapReduce model to parallelize analytics tasks and retrieve results from large datasets. It implements MapReduce on an indexed datastore using its search language.
- The Splunk search language allows users to express complex data processing pipelines without writing code. Splunk automatically converts search queries into map and reduce functions to run in parallel.
- Splunk employs both temporal and spatial MapReduce. Temporal MapReduce splits data by time within a node, while spatial MapReduce distributes processing across multiple nodes in a cluster.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...Databricks
Many queries in Spark workloads execute over unstructured or text-based data formats, such as JSON or CSV files. Unfortunately, parsing these formats into queryable DataFrames or DataSets is often the slowest stage of these workloads, especially for interactive, ad-hoc analytics. In many instances, this bottleneck can be eliminated by taking filters expressed in the high-level query (e.g., a SQL query in Spark SQL) and pushing the filters into the parsing stage, thus reducing the total number of records that need to be parsed.
In this talk, we present Sparser, a new parsing library in Spark for JSON, CSV, and Avro files. By aggressively filtering records before parsing them, Sparser achieves up to 9x end-to-end runtime improvement on several real-world Spark SQL workloads. Using Spark’s Data Source API, Sparser extracts the filtering expressions specified by a Spark SQL query; these expressions are then compiled into fast, SIMD-accelerated “pre-filters” which can discard data at an order of magnitude faster than the JSON and CSV parsers currently available in Spark.
These pre-filters are approximate and may produce false positives; thus, Sparser intelligently selects the best set of pre-filters that minimizes the overall parsing runtime for any given query. We show that, for Spark SQL queries with low selectivity (i.e., very selective filters), Sparser routinely outperforms the standard parsers in Spark by at least 3x. Sparser can be used as a drop-in replacement for any Spark SQL query; our code is open-source, and our Spark package will be made public soon.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
Big data analysis has become much popular in the present day scenario and the manipulation of
big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical
departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed everyday by different websites
associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an
efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks
An informational, or statistical, constraint is a constraint such as a unique, primary key, foreign key, or check constraint that can be used by Apache Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. Informational constraints will be primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations.
This session will cover the support for primary and foreign key (referential integrity) constraints in Spark. You’ll learn about the constraint specification, metastore storage, constraint validation and maintenance. You’ll also see examples of query optimizations that utilize referential integrity constraints, such as Join and Distinct elimination and Star Schema detection.
Spark streaming State of the Union - Strata San Jose 2015Databricks
The lead developer of the Apache Spark Streaming library at Databricks, Tathagata "TD" Das, provides an overview of Spark streaming and previews what's the come.
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...Puneet Kansal
This research article experimentally shows, how Multiple Queries can be provided to Hive and their execution can be reduced by searching common expression and common data source. The TPC-H queries are used for demonstration and test data is generated in variation of 2 GB, 5 GB and 10 GB using DBGEN software. Technology used in this is HADOOP and Hive. HADOOP is configured in Single user over ubunto Operating system.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I will dive deep into different stateful operations (streaming aggregations, deduplication and joins) and how they work under the hood in the Structured Streaming engine.
In this presentation we'll explain how to use the R programming language with Spark using a Databricks notebook and the SparkR package. We'll discuss how to push data wrangling to the Spark nodes for massive scale and how to bring it back to a single node so we can use open source packages on the data. We'll demonstrate converting SQL tables into R distributed data frames and how to convert R data frames to SQL tables. We'll also have a look at how to train predictive models using data distributed over the Spark nodes. Bring your popcorn. This is a fun and interesting presentation.
Speaker: Bryan Cafferky
A comparative survey based on processing network traffic data using hadoop pi...ijcses
Big data analysis has now become an integral part of many computational and statistical departments.
Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data
manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every
minute by different websites related to e-commerce, shopping carts and online banking. Here comes the
need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can
efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
Data analysis using hive ql & tableaupkale1708
The purpose of this study is to develop a system which will assist a user to determine if a location can be entitled as a “Safe” residence or not. The output will be based on an analysis carried out on the local crime history of the city. This involves examining a huge geolocation data and zeroing down to a single area. The area with majority crime incidents will be highlighted as Unsafe. Clicking/hovering on a single record will display name, associated crime and its rank depending on number of crimes occurred. Big Data Hadoop and Hive systems are implemented in Azure for the analysis.
An overview of crime report and analysis shows a significant amount of information related to crime. Multiple factors need to be considered while studying the different aspects of crime. These multiple measures are found in Uniform Crime Reports data and the National Crime Victimization Survey, a survey that interrogates the victim about their experience. Our paper depicts the nature and characteristics of crime using Hadoop Big Data systems, especially Hive in Azure. Besides, the map of the Geo-location presents which area is safe or unsafe. The results of different Hive queries are visualized using Tableau.
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...Databricks
Many queries in Spark workloads execute over unstructured or text-based data formats, such as JSON or CSV files. Unfortunately, parsing these formats into queryable DataFrames or DataSets is often the slowest stage of these workloads, especially for interactive, ad-hoc analytics. In many instances, this bottleneck can be eliminated by taking filters expressed in the high-level query (e.g., a SQL query in Spark SQL) and pushing the filters into the parsing stage, thus reducing the total number of records that need to be parsed.
In this talk, we present Sparser, a new parsing library in Spark for JSON, CSV, and Avro files. By aggressively filtering records before parsing them, Sparser achieves up to 9x end-to-end runtime improvement on several real-world Spark SQL workloads. Using Spark’s Data Source API, Sparser extracts the filtering expressions specified by a Spark SQL query; these expressions are then compiled into fast, SIMD-accelerated “pre-filters” which can discard data at an order of magnitude faster than the JSON and CSV parsers currently available in Spark.
These pre-filters are approximate and may produce false positives; thus, Sparser intelligently selects the best set of pre-filters that minimizes the overall parsing runtime for any given query. We show that, for Spark SQL queries with low selectivity (i.e., very selective filters), Sparser routinely outperforms the standard parsers in Spark by at least 3x. Sparser can be used as a drop-in replacement for any Spark SQL query; our code is open-source, and our Spark package will be made public soon.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
Big data analysis has become much popular in the present day scenario and the manipulation of
big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical
departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed everyday by different websites
associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an
efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks
An informational, or statistical, constraint is a constraint such as a unique, primary key, foreign key, or check constraint that can be used by Apache Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. Informational constraints will be primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations.
This session will cover the support for primary and foreign key (referential integrity) constraints in Spark. You’ll learn about the constraint specification, metastore storage, constraint validation and maintenance. You’ll also see examples of query optimizations that utilize referential integrity constraints, such as Join and Distinct elimination and Star Schema detection.
Spark streaming State of the Union - Strata San Jose 2015Databricks
The lead developer of the Apache Spark Streaming library at Databricks, Tathagata "TD" Das, provides an overview of Spark streaming and previews what's the come.
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...Puneet Kansal
This research article experimentally shows, how Multiple Queries can be provided to Hive and their execution can be reduced by searching common expression and common data source. The TPC-H queries are used for demonstration and test data is generated in variation of 2 GB, 5 GB and 10 GB using DBGEN software. Technology used in this is HADOOP and Hive. HADOOP is configured in Single user over ubunto Operating system.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I will dive deep into different stateful operations (streaming aggregations, deduplication and joins) and how they work under the hood in the Structured Streaming engine.
In this presentation we'll explain how to use the R programming language with Spark using a Databricks notebook and the SparkR package. We'll discuss how to push data wrangling to the Spark nodes for massive scale and how to bring it back to a single node so we can use open source packages on the data. We'll demonstrate converting SQL tables into R distributed data frames and how to convert R data frames to SQL tables. We'll also have a look at how to train predictive models using data distributed over the Spark nodes. Bring your popcorn. This is a fun and interesting presentation.
Speaker: Bryan Cafferky
A comparative survey based on processing network traffic data using hadoop pi...ijcses
Big data analysis has now become an integral part of many computational and statistical departments.
Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data
manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every
minute by different websites related to e-commerce, shopping carts and online banking. Here comes the
need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can
efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
Data analysis using hive ql & tableaupkale1708
The purpose of this study is to develop a system which will assist a user to determine if a location can be entitled as a “Safe” residence or not. The output will be based on an analysis carried out on the local crime history of the city. This involves examining a huge geolocation data and zeroing down to a single area. The area with majority crime incidents will be highlighted as Unsafe. Clicking/hovering on a single record will display name, associated crime and its rank depending on number of crimes occurred. Big Data Hadoop and Hive systems are implemented in Azure for the analysis.
An overview of crime report and analysis shows a significant amount of information related to crime. Multiple factors need to be considered while studying the different aspects of crime. These multiple measures are found in Uniform Crime Reports data and the National Crime Victimization Survey, a survey that interrogates the victim about their experience. Our paper depicts the nature and characteristics of crime using Hadoop Big Data systems, especially Hive in Azure. Besides, the map of the Geo-location presents which area is safe or unsafe. The results of different Hive queries are visualized using Tableau.
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
Big data analysis has become much popular in the present day scenario and the manipulation of big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed every day by different websites associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
Business intelligence requirements are changing and business users are moving more and more from historical reporting into predictive analytics in an attempt to get both a better and deeper understanding of their data. Traditionally, building an analytical platform has required an expensive infrastructure and a considerable amount of time for setup and deployment. Here we look at a quick and simple alternative.
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Samsung R&D.
While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts).
Jeff manages the Data Platform Architecture group at Netflix where he is helping to build a service oriented architecture that enables easy access to large scale cloud based analytical processing and analysis of data across the organization. Prior to Netflix, he received his PhD from the University of Florida focusing on database system implementation.
Slides from the Big Data Gurus meetup at Samsung R&D, August 14, 2013
This presentation covers the high level architecture of the Netflix Data Platform with a deep dive into the architecture, implementation, use cases, and future of Lipstick (https://github.com/Netflix/Lipstick) - our open source tool for graphically analyzing and monitoring the execution of Apache Pig scripts.
Netflix uses Apache Pig to express many complex data manipulation and analytics workflows. While Pig provides a great level of abstraction between MapReduce and data flow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. To address this problem, we created (and open sourced) a tool named Lipstick that visualizes and monitors the progress and performance of Pig scripts.
A whitepaper is about How big data engines are used for exploring and preparing data, building pipelines, and delivering data sets to ML applications.
https://www.qubole.com/resources/white-papers/big-data-engineering-for-machine-learning
Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets.Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
Title: Big Data on Implementation of Many to Many Clustering
Author: Ravi. R, Michael. G
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technological
period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the performance
of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
This presentation is a real-world case study about moving a large portfolio of batch analytical programs that process 30 billion or more transactions every day, from a proprietary MPP database appliance architecture to the Hadoop ecosystem in the cloud, leveraging Hive, Amazon EMR, and S3.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Free Complete Python - A step towards Data Science
Splunk and map_reduce
1. Splunk Inc. | 250 Brannan Street | San Francisco, CA 94107 | www.splunk.com | info@splunk.com
Copyright 2011 Splunk Inc. All rights reserved.
Splunk Technical Paper:
Large-Scale, Unstructured Data Retrieval and Analysis
Using Splunk.
An Easier, More Productive Way to Leverage the Proven MapReduce Paradigm
Stephen Sorkin, VP of Engineering
2. Background:
Splunk is a general-purpose search, analysis and reporting engine for time-series text data, typically machine
data. Splunk software is deployed to address one or more core IT functions: application management, security,
compliance, IT operations management and providing analytics for the business.
Splunk versions 1-2 were optimized to deliver a typical Web search experience, where a search is run and the first
10 to 100 results are displayed quickly. Any results past a fixed limit were irretrievable, except if some other
search criterion were added. Under these conditions Splunk scalability had its limitations. In version 3 Splunk
added a statistical capability, enhancing the ability to perform analytics. With this addition, searches could be run
against the index to deliver more context.
In response to input from Splunk users and to address the challenges of large-scale data processing, in July 2009
Splunk 4 changed this model by adding the facility to retrieve and analyze massive datasets using a "divide and
conquer" approach: the MapReduce mechanism. With version 4, Splunk optimized the execution of its search
language using the MapReduce model where parallelism is key.
Of course, parallelizing analytics via MapReduce is not unique to Splunk. However, the Splunk implementation of
MapReduce on top of an indexed datastore with its expressive search language provides a simpler, faster way to
analyze huge volumes of machine data. Beyond this, the Splunk MapReduce implementation is part of a
complete solution for collecting, storing and processing machine data in real time, thereby eliminating the need
to write or maintain code—a requisite of more generic MapReduce implementations.
This paper describes the Splunk approach to machine data processing on a large scale, based on the MapReduce
model.
What is MapReduce?
MapReduce is the model of distributed data processing introduced by Google in 2004. The fundamental concept
of MapReduce is to divide problems into two parts: a map function that processes source data into sufficient
statistics and a reduce function that merges all sufficient statistics into a final answer. By definition, any number
of concurrent map functions can be run at the same time without intercommunication. Once all the data has had
the map function applied to it, the reduce function can be run to combine the results of the map phases.
For large scale batch processing and high speed data retrieval, common in Web search scenarios, MapReduce
provides the fastest, most cost-effective and most scalable mechanism for returning results. Today, most of the
leading technologies for managing "big data" are developed on MapReduce. With MapReduce there are few
scalability limitations, but leveraging it directly does require writing and maintaining a lot of code.
This paper assumes moderate knowledge of the MapReduce framework. For further reference, Jeffrey Dean and
Sanjay Ghemawat of Google have prepared an introductory paper on the subject: http://labs.google.com/
papers/mapreduce.html. Additionally, the Google and Yahoo papers on Sawzall http://labs.google.com/papers/
sawzall.html and Pig http://research.yahoo.com/project/90 provide good introductions to alternative abstraction
layers built on top of MapReduce infrastructures for large-scale data processing tasks.
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk. Page 1 of 7
Copyright 2011 Splunk Inc. All rights reserved.
3. Speed and Scale Using a Unique Data Acquisition/Storage Method
The Splunk engine is optimized for quickly indexing and persisting unstructured data loaded into the system.
Specifically, Splunk uses a minimal schema for persisted data – events consist only of the raw event text, implied
timestamp, source (typically the filename for file based inputs), sourcetype (an indication of the general“type”of
data) and host (where the data originated). Once data enters the Splunk system, it quickly proceeds through
processing, is persisted in its raw form and is indexed by the above fields along with all the keywords in the raw
event text.
In a properly sized deployment, data is almost always available for search and reporting seconds after it was
generated and written to disk; network inputs make this even faster. This is different from Google’s MapReduce
implementation and Hadoop, which are explicitly agnostic towards loading data into the system.
Indexing is an essential element of the canonical“super-grep”use case for Splunk, but it also makes most
retrieval tasks faster. Any more sophisticated processing on these raw events is deferred until search time. This
serves four important goals: indexing speed is increased as minimal processing is performed, bringing new data
into the system is a relatively low effort exercise as no schema planning is needed, the original data is persisted
for easy inspection and the system is resilient to change as data parsing problems do not require reloading or re-
indexing the data.
A typical high-performance Splunk deployment involves many servers acting as indexers. As data is generated
throughout the network from servers, desktops or network devices, it is forwarded to one of the many indexers.
No strong affinity exists between forwarder and indexer, which prevents bottlenecks that arise from
interconnections in the system. Moreover, any data can be stored on any indexer, which makes it easy to balance
the load on the indexers. Although spreading the data across many indexers can be thought of as a mechanism
to increase indexing performance, through MapReduce, the big win comes during search and reporting, as will
be seen in the following sections.
Simplified Analysis with the Splunk Search Language
The Splunk Search Language is a concise way to express the processing of sparse or dense tabular data, using
the MapReduce mechanism, without having to write code or understand how to split processing between the
map and reduce phases. The main inspiration of the Splunk Search Language was Unix shells, which allow
“piping”data from one discrete process to another to easily achieve complicated data processing tasks. Within
Splunk this is exposed as a rich set of search commands, which can be formed into linear pipelines. The mental
model is that of passing a table from command to command. The complication for large-scale tasks is that the
table may have an arbitrarily large number of rows at any step of the processing.
The columns in the search-time tables are commonly referred to as fields in this document. Unlike in typical
databases, these columns do not exist in the persisted representation of the data in Splunk. Through search, the
raw data is enriched with fields. Through configuration, fields are automatically added via techniques like
automatic key/value extraction, explicit extraction through regular expressions or delimiters or through SQL
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk. Page 2 of 7
Copyright 2011 Splunk Inc. All rights reserved.
4. join-like lookups. Fields are also constructed on the fly in search using commands like“eval”for evaluating
expressions,“rex”for running regular expression-based extractions and“lookup”for applying a lookup table.
Below are examples of some common and useful searches in the Splunk Search Language:
• search googlebot
This search simply retrieves all events that contain the string“googlebot.”This is very similar to running
“grep –iw googlebot”on the data set, though this case benefits from the index. This is useful in the
context of Web access logs, where a common task is understanding what common search bots are
causing hits and which resources they are hitting.
• search sourcetype=access_combined | timechart count by status
This search retrieves all data marked with the sourcetype "access_combined," which is typically
associated with web access log data. The data is then divided into uniform time-based groups. For each
time group, the count of events are calculated for every distinct HTTP "status" code. This search is
commonly used to determine the general health of a web server or service. It is especially helpful for
noticing an increase of errors, either 404s which may be associated with a new content push or 500s
which could be related to server issues.
• search sourcetype=access_combined | transaction clientip maxpause=10m
| timechart median(eventcount) perc95(eventcount)
This search again retrieves all web access log data. The events are then gathered into transactions
identified by the client IP address, where events more than 10 minutes apart are considered to be in
separate transactions. Finally, the data is summarized by time and, for each discrete time range, the
median and 95th percentile session length is calculated.
Formulating a Distributed Task Without Writing a Lot of Code
In alternative MapReduce frameworks like Google’s own, or the open source Hadoop, it is the user’s job to break
any distinct retrieval task into the map and reduce functions. For this reason, both Google and Yahoo have built
layers on top of a MapReduce infrastructure to simplify this for end users. Google’s data processing system is
called Sawzall and Yahoo’s is called Pig. A program written in either of these languages can be automatically
converted into a MapReduce task and can be run in parallel across a cluster. These layered alternatives are
certainly easier than native MapReduce frameworks, but they all require writing an extensive amount of code.
Within Splunk, any search in the Search Language is automatically converted into a parallelizable program (the
equivalent of writing code in Hadoop or scripts in Pig or Sawzall). This happens by scanning the search from
beginning to end and determining the first search command that cannot be executed in parallel. That search
command is then asked for its preferred map function. The fully-parallelizable commands combined with the
map function of the first non-parallelizable command are considered the map function for the search. All
subsequent commands are considered the reduce function for the search. We can consider two simple examples
from the world of Web Analytics to illustrate this concept:
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk. Page 3 of 7
Copyright 2011 Splunk Inc. All rights reserved.
5. • search sourcetype=access_combined | timechart count by status
This search first retrieves all events from the sourcetype“access_combined.”Next, the events are
discretized temporally and the number of events for each distinct status code is reported for each time
period. In this case,“search sourcetype=access_combined”would be run in parallel on all nodes of the
cluster that contained suitable data. The“timechart count by status”part would be the reduce step, and
would be run on the searching node of the cluster. For efficiency, the timechart command would
automatically contribute a summarizing phase to the map step, which would be translated to“search
sourcetype=access_combined | pretimechart count by status.” This would allow the nodes of the
distributed cluster to only send sufficient statistics (as opposed to the whole events that come from the
search command). The network communications would be tables that contain tuples of the form
(timestamp, count, status).
• eventtype=pageview | eval ua=mvfilter(eventtype LIKE "ua-browser-%")
| timechart dc(clientip) by ua
Events that represent pageviews are first retrieved. Next, a field called“ua”is calculated from the
eventtype fields, which contains many other event classifications. Finally, the events are discretized
temporally and the number of distinct client IPs for each browser is reported for each time period. In this
case,“search eventtype=pageview | eval ua=mvfilter(eventtype LIKE“ua-browser-%”)”would be run in
parallel on all nodes of the cluster that contained suitable data. The“timechart dc(clientip) by ua”part
would be the reduce step, and would be run on the searching node of the cluster. Like before, the
timechart command would automatically contribute a summarizing phase to the map step, which
would be translated to“search eventtype=pageview | eval ua=mvfilter(eventtype LIKE“ua-browser-%”) |
pretimechart dc(clientip) by ua.”The network communications would be tables that contain tuples of
the form (timestamp, clientip, ua, count).
Temporal MapReduce – How Search Scales on a Single Node
Splunk uses a MapReduce model even when running on a single node. Instead of running the map function on
each node in the cluster, Splunk divides the data into mutually exclusive but jointly exhaustive time spans. The
map function is run independently on each of these time spans and the outputted sufficient statistics are
preserved on disk. Once the entire set of data has been analyzed, the reduce function is run on each of the sets
of sufficient statistics.
Figure 1:
Temporal MapReduce.
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk. Page 4 of 7
Copyright 2011 Splunk Inc. All rights reserved.
6. Enhance the Efficiency of the MapReduce Scheme with Previews
To preserve the attractive interactive behavior of search even when running over very large datasets, some
enhancements are made over this basic MapReduce scheme when running searches from Splunk. One
important change is to run the reduce phase on a regular basis over the sets of sufficient statistics in order to
generate previews of the final results without having to wait for the search to complete. This is typically very
helpful in refining complicated searches by identifying errors in a search (e.g., misspellings of a field, incorrect
values specified, or improper statistical aggregation) before too much time passes and resources are wasted.
Spatial MapReduce – How Search Scales on Additional Nodes
Splunk also provides a facility, like typical MapReduce, to speed computation by distributing processing
throughout a cluster of many machines. In Splunk this is called“Distributed Search.”After a search has been
formulated into the map and reduce functions, network connections are established to each Splunk Indexer in
the search cluster. The map function is sent to each of these Splunk instances and each begins processing data
using the Temporal MapReduce scheme. As data streams back to the instance that initiated the search, it is
persisted to disk for the reduce
function.
In general, for pure reporting searches
that have a map function that
compresses data for network
transport, reporting speed is linear
with the number of index servers in
the cluster. Some searches, however,
don’t have such a map function. For
example, a search for “*”of all the
data indexed by every server must be
returned to the searching instance for
accumulation. Here the instance can
become the bottleneck. There are
some pure searches that benefit
dramatically from the distribution of work. Searches that are dominated by I/O costs (needle-in-a-haystack type
searches) and those that are dominated by decompression costs (retrieving many records well distributed in
time and not dense in arrival) both benefit dramatically from Distributed Search. In the first case the effective I/O
bandwidth of the cluster scales linearly with the number of machines in the cluster. In the second case the CPU
resources scale linearly.
For the majority of use cases (i.e., needle-in-a-haystack troubleshooting or reporting and statistical aggregation)
scalability is only limited by the amount of indexers dedicated to the processing task.
Figure 2: Distributed Search in Splunk.
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk. Page 5 of 7
Copyright 2011 Splunk Inc. All rights reserved.
7. Achieving Additional Scalability Through Summary Indexing
The problem of analyzing massive amounts of data was addressed in earlier versions of Splunk using a technique
called Summary Indexing. The general idea is that relatively small blocks of raw data (datasets between five
minutes and an hour in length) are processed on a regular basis and the result of this processing is persisted in a
similar time-series index as the original raw data. This can be immediately seen as the map function of
MapReduce. Later, when questions are asked of the data, these sufficient statistics are retrieved and reduced into
the answer.
This technique is still valuable in Splunk 4 if common questions are repeatedly asked. By running this
preprocessing over the data on a regular basis (typically every hour or every day), the sufficient statistics are
always ready for common reports.
Comparing Languages: Sawzall and the Splunk Search Language
A key benefit of the Splunk Search Language is the simplicity it provides for common tasks. For illustration, we
will consider the first example from the Sawzall paper. This example computes the number of records, sum of the
values and sum of the squares of the values.
An example of the Sawzall language follows:
count: table sum of int;
total: table sum of float;
sum_of_squares: table sum of float;
x: float = input;
emit count <- 1;
emit total <- x;
emit sum_of_squares <- x * x;
Achieving this same task using the Splunk Search Language is far more efficient:
source=<filename> | stats count sum(_raw) sumsq(_raw)
Note: Published docs from Google tout Sawzall as being a far simpler language than using a MapReduce framework (like Hadoop) directly.
According to Google, typical Sawzall programs are 10-20 times shorter than equivalent MapReduce programs in C++.
The efficiency and ease with which Splunk delivers scalability over massive datasets is not confined to this
comparison. The contrast becomes sharper if one must properly extract value(s) from each line of the file. Splunk
provides field extractions from files by delimiters or regular expressions, either automatically for all files of a
certain type or on a search-by-search basis. Additionally, Splunk enables users to learn regular expressions
interactively based on examples from within a data set, so the user and product build knowledge over time.
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk. Page 6 of 7
Copyright 2011 Splunk Inc. All rights reserved.
8. Conclusion – Out-of-the-box Scalability for Big Data with Splunk
When it comes to processing massive datasets the implementation of MapReduce is the performance and
scalability model for parallel processing executed over large clusters of commodity hardware. Out of
MapReduce, several languages and frameworks have evolved: Google Sawzall, Yahoo! Pig, the open source
Hadoop framework and Splunk. While MapReduce is an essential element to scaling the capabilities of search
and reporting in Splunk, the out-of-the-box benefits of using Splunk for large-scale data retrieval extend beyond
MapReduce processing.
Unlike other MapReduce languages and frameworks that require custom scripts or code for every new task,
Splunk utilizes its search language to automate complex processing. The Splunk Search Language makes
challenging data analysis tasks easier without requiring the user to control how it scales. With its MapReduce
underpinnings, the scalability of Splunk is only limited by the amount of resources dedicated to running Splunk
indexers.
Beyond the simplicity of the search language, Splunk provides a universal indexing capability to automate data
access and loading. With numerous mechanisms for loading data, none of which require developing or
maintaining code, Splunk users are productive quickly.
The speed and efficiency of Splunk are also enhanced through multiple methods of distributing the index
(storage) to disk. With the ability to distribute storage across a single node, multiple nodes, or to preview a
search in progress, Splunk provides flexible alternatives for optimizing scalability.
Analyzing large datasets is simplified with the Splunk user interface. The Splunk UI, enhanced over several
product releases, is highly domain appropriate for the troubleshooting, on-the-fly reporting and dashboard
creation associated with large-scale machine data analysis.
Download a Free version of Splunk: http://www.splunk.com/download
For more information on Splunk please visit: http://www.splunk.com
###
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk. Page 7 of 7
Copyright 2011 Splunk Inc. All rights reserved.