A quick comparison of Hadoop and Apache Spark with a detailed introduction.
Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes. They do different things.
Looking for Similar IT Services?
Write to us business@altencalsoftlabs.com
(OR)
Visit Us @ https://www.altencalsoftlabs.com/
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
A quick comparison of Hadoop and Apache Spark with a detailed introduction.
Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes. They do different things.
Looking for Similar IT Services?
Write to us business@altencalsoftlabs.com
(OR)
Visit Us @ https://www.altencalsoftlabs.com/
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
This slide deck is used as an introduction to the internals of Apache Spark, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems of understanding and information within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implements graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, we’ll discuss the work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations of connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; see its performance in the context of other popular graph libraries; and hear about real-world applications.
This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture.
YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
This slide deck is used as an introduction to the internals of Apache Spark, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems of understanding and information within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implements graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, we’ll discuss the work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations of connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; see its performance in the context of other popular graph libraries; and hear about real-world applications.
This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture.
YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
R is more than just a language. Many of the reasons why R has become such a popular tool for data science come from the ecosystem surrounding the R project. R users benefit from the many resources and packages created by the community, while commercial companies (including Microsoft) provide tools to extend and support R, and services to help people use R.
In this talk, I will give an overview of the R Ecosystem and describe how it has been a critical component of R’s success, and include several examples of Microsoft’s contributions to the ecosystem.
(Presented to EARL London, September 2016)
Agriculture in developing countries must undergo a significant transformation in order to meet the related challenges of achieving food security and responding to climate change. Projections based on population growth and food consumption patterns indicate that agricultural production will need to increase by at least 70 percent to meet demands by 2050. Most estimates also indicate that climate change is likely to reduce agricultural productivity, production stability and incomes in some areas that already have high levels of food insecurity. Developing climate-smart agriculture is thus crucial to achieving future food security and climate change goals. This seminar describe an approach to deal with the above issue viz. Climate Smart Agriculture (CSA) and also examines some of the key technical, institutional, policy and financial responses required to achieve this transformation. Building on cases from the field, the seminar try to outlines a range of practices, approaches and tools aimed at increase the resilience and productivity of agricultural product systems, while also reducing and removing emissions. A part of the seminar elaborates institutional and policy options available to promote the transition to climate-smart agriculture at the smallholder level. Finally, the paper considers current gaps and makes innovative suggestion regarding the combined use of different sources, financing mechanism and delivery systems.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
This Edureka k-means clustering algorithm tutorial will take you through the machine learning introduction, cluster analysis, types of clustering algorithms, k-means clustering, how it works along with an example/ demo in R. This Data Science with R tutorial is ideal for beginners to learn how k-means clustering work. You can also read the blog here: https://goo.gl/3aseSs
These slides were designed for Apache Hadoop + Apache Apex workshop (University program).
Audience was mainly from third year engineering students from Computer, IT, Electronics and telecom disciplines.
I tried to keep it simple for beginners to understand. Some of the examples are using context from India. But, in general this would be good starting point for the beginners.
Advanced users/experts may not find this relevant.
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...Edureka!
This Salesforce CRM tutorial will take you through what is Salesforce CRM, benefits of Salesforce CRM, how CRM works, along with Salesforce CRM demo and use case. This Salesforce training slides is ideal for beginners to learn CRM.
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions!
About the Speaker
Russell Spitzer Software Engineer, DataStax
Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop.
Finally we introduce spark using a docker image and we show how to use anonymous function in spark.
The topics of the next slides will be
- Spark Shell (Scala, Python)
- Shark Shell
- Data Frames
- Spark Streaming
- Code Examples: Data Processing and Machine Learning
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
Apache Spark: The Analytics Operating SystemAdarsh Pannu
This presentation was delivered by Adarsh Pannu at IBM's Insight Conference in Nov 2015. For a recording, visit: https://www.youtube.com/watch?v=Tbm7HIlmwJQ
The presentation provides an overview of Apache Spark, a general-purpose big data processing engine built around speed, ease of use and sophisticated analytics. It enumerates the benefits of incorporating Spark in the enterprise, including how it allows developers to write fully-featured distributed applications ranging from traditional data processing pipelines to complex machine learning. The presentation uses the Airline "On Time" data set to explore various components of the Spark stack.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
2. About
Me
• Diverse
roles/languages
and
pla=orms.
• Middleware
space
in
recent
years.
• Worked
for
IBM/Grid
Dynamics/GigaSpaces.
• Working
as
Systems
Engineer
for
Cloudera
since
last
July.
• Work
with
and
educate
clients/prospects.
2
6. A
brief
review
of
MapReduce
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Reduce
Key
advances
by
MapReduce:
• Data
Locality:
AutomaLc
split
computaLon
and
launch
of
mappers
appropriately
• Fault
tolerance:
Write
intermediate
results
and
restartable
mappers
means
ability
to
run
on
commodity
hardware
• Linear
scalability:
CombinaLon
of
locality
+
programming
model
that
forces
developers
to
write
generally
scalable
soluLons
to
problems
6
7. MapReduce
sufficient
for
many
classes
of
problems
MapReduce
Hive
Pig
Mahout
Crunch
Solr
A
bit
like
Haiku:
• Limited
expressivity
• But
can
be
used
to
approach
diverse
problem
domains
7
8. BUT…
Can
we
do
beUer?
Areas
ripe
for
improvement,
• Launching
Mappers/Reducers
takes
Lme
• Having
to
write
to
disk
(replicated)
between
each
step
• Reading
data
back
from
disk
in
the
next
step
• Each
Map/Reduce
step
has
to
go
back
into
the
queue
and
get
its
resources
• Not
In
Memory
• Cannot
iterate
fast
8
9. What
is
Spark?
Spark
is
a
general
purpose
computaLonal
framework
-‐
more
flexibility
than
MapReduce.
It
is
an
implementaLon
of
a
2010
Berkley
paper
[1].
Key
properBes:
• Leverages
distributed
memory
• Full
Directed
Graph
expressions
for
data
parallel
computaLons
• Improved
developer
experience
Yet
retains:
Linear
scalability,
Fault-‐tolerance
and
Data
Locality
based
computaLons
1
-‐
hUp://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
9
10. Spark:
Easy
and
Fast
Big
Data
• Easy
to
Develop
– Rich
APIs
in
Java,
Scala,
Python
– InteracLve
shell
• Fast
to
Run
– General
execuLon
graphs
– In-‐memory
storage
2-‐5×
less
code
Up
to
10×
faster
on
disk,
100×
in
memory
10
11. Easy:
Get
Started
Immediately
• MulL-‐language
support
• InteracLve
Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
11
18. Driver
and
Workers
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM
18
19. RDD
–
Resilient
Distributed
Dataset
• Read-‐only
parLLoned
collecLon
of
records
• Created
through:
– TransformaLon
of
data
in
storage
– TransformaLon
of
RDDs
• Contains
lineage
to
compute
from
storage
• Lazy
materializaLon
• Users
control
persistence
and
parLLoning
19
21. OperaLons
• TransformaBons
create
new
RDD
from
an
exisLng
one
• AcBons
run
computaLon
on
RDD
and
return
a
value
• TransformaLons
are
lazy.
• AcLons
materialize
RDDs
by
compuLng
transformaLons.
• RDDs
can
be
cached
to
avoid
re-‐compuLng.
21
22. Fault
Tolerance
• RDDs
contain
lineage.
• Lineage
–
source
locaLon
and
list
of
transformaLons
• Lost
parLLons
can
be
re-‐computed
from
source
data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS
File
Filtered
RDD
Mapped
RDD
filter
(func
=
startsWith(…))
map
(func
=
split(...))
22
23. Caching
• Persist()
and
cache()
mark
data
• RDD
is
cached
ater
first
acLon
• Fault
tolerant
–
lost
parLLons
will
re-‐compute
• If
not
enough
memory
–
some
parLLons
will
not
be
cached
• Future
acLons
are
performed
on
cached
parLLoned
• So
they
are
much
faster
Use
caching
for
iteraBve
algorithms
23
31. LogisLc
Regression
• Read
two
sets
of
points
• Looks
for
a
plane
W
that
separates
them
• Perform
gradient
descent:
– Start
with
random
W
– On
each
iteraLon,
sum
a
funcLon
of
W
over
the
data
– Move
W
in
a
direcLon
that
improves
it
31
33. LogisLc
Regression
val points =
spark.textFile(…).map(parsePoint).cache()!
!
val w = Vector.random(D)!
!
for (I <- 1 to ITERATIONS) {!
"val gradient = points.map(p => !
" "(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )!
" ".reduce(_+_)!
"w -= gradient!
}!
println(“Final separating plane: ” + w)!
33
34. Conviva
Use-‐Case
[1]
• Monitor
online
video
consumpLon
• Analyze
trends
Need
to
run
tens
of
queries
like
this
a
day:
SELECT videoName, COUNT(1)
FROM summaries
WHERE date='2011_12_12' AND customer='XYZ'
GROUP BY videoName;
1
-‐
hUp://www.conviva.com/using-‐spark-‐and-‐hive-‐to-‐process-‐bigdata-‐at-‐conviva/
34
35. Conviva
With
Spark
val
sessions
=
sparkContext.sequenceFile[SessionSummary,NullWritable]
(pathToSessionSummaryOnHdfs)
val
cachedSessions
=
sessions.filter(whereCondiLonToFilterSessions).cache
val
mapFn
:
SessionSummary
=>
(String,
Long)
=
{
s
=>
(s.videoName,
1)
}
val
reduceFn
:
(Long,
Long)
=>
Long
=
{
(a,b)
=>
a+b
}
val
results
=
cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
35
40. Spark
Streaming
– Run
con$nuous
processing
of
data
using
Spark’s
core
API.
– Extends
Spark
concept
of
RDD’s
to
DStreams
(DiscreLzed
Streams)
which
are
fault
tolerant,
transformable
streams.
Users
can
re-‐use
exisLng
code
for
batch/offline
processing.
– Adds
“rolling
window”
operaLons.
E.g.
compute
rolling
averages
or
counts
for
data
over
last
five
minutes.
– Example
use
cases:
• “On-‐the-‐fly”
ETL
as
data
is
ingested
into
Hadoop/HDFS.
• DetecLng
anomalous
behavior
and
triggering
alerts.
• ConLnuous
reporLng
of
summary
metrics
for
incoming
data.
40
41. val
tweets
=
ssc.twitterStream()
val
hashTags
=
tweets.flatMap
(status
=>
getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch
@
t+1
batch
@
t
batch
@
t+2
tweets
DStream
hashTags
DStream
Stream
composed
of
small
(1-‐10s)
batch
computaLons
“Micro-‐batch”
Architecture
41
45. Shark
Vs
Impala
• Shark
inherits
Hive
limitaLons
while
Impala
is
purpose
built
for
SQL.
• Impala
is
significantly
faster
per
our
tests.
• Shark
does
not
have
security,
audit/lineage,
support
for
high-‐concurrency,
operaLonal
tooling
for
config/monitor/reporLng/
debugging.
• InteracLve
SQL
needed
for
connecLng
BI
Tools.
Shark
not
cerLfied
by
any
BI
vendor.
45
48. Why
Spark?
• Flexible
like
MapReduce
• High
performance
• Machine
learning,
iteraLve
algorithms
• InteracLve
data
exploraLons
• Developer
producLvity
48
49. How
Spark
Works?
• RDDs
–
resilient
distributed
data
• Lazy
transformaLons
• Caching
• Fault
tolerance
by
storing
lineage
• Streams
–
micro-‐batches
of
RDDs
• Shark
–
Hive
+
Spark
49
Editor's Notes
* MapReduce struggles from performance optimization for individual systems because of its design* Google has used both techniques in-house quite a bit and the future will contain both
Spark’s storage levels are meant to provide different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one:If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.If you want to define your own storage level (say, with replication factor of 3 instead of 2), then use the function factor method apply() of the StorageLevel singleton object.