Big Data with Hadoop & Spark Training: http://bit.ly/2shXBpj
This CloudxLab Apache Spark - Loading & Saving data tutorial helps you to understand Loading & Saving data in Apache Spark in detail. Below are the topics covered in this tutorial:
1) Common Data Sources
2) Common Supported File Formats
3) Handling Text Files using Scala
4) Loading CSV
5) SequenceFiles
6) Object Files
7) Hadoop Input and Output Format - Old and New API
8) Protocol Buffers
9) File Compression
10) Handling LZO
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Loading XML
2) What is RPC - Remote Process Call
3) Loading AVRO
4) Data Sources - Parquet
5) Creating DataFrames From Hive Table
6) Setting up Distributed SQL Engine
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA
This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial:
1) Spark Architecture
2) Why Apache Spark?
3) Shortcoming of MapReduce
4) Downloading Apache Spark
5) Starting Spark With Scala Interactive Shell
6) Starting Spark With Python Interactive Shell
7) Getting started with spark-submit
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2xkCd84
This CloudxLab Introduction to Hive tutorial helps you to understand Hive in detail. Below are the topics covered in this tutorial:
1) Hive Introduction
2) Why Do We Need Hive?
3) Hive - Components
4) Hive - Limitations
5) Hive - Data Types
6) Hive - Metastore
7) Hive - Warehouse
8) Accessing Hive using Command Line
9) Accessing Hive using Hue
10) Tables in Hive - Managed and External
11) Hive - Loading Data From Local Directory
12) Hive - Loading Data From HDFS
13) S3 Based External Tables in Hive
14) Hive - Select Statements
15) Hive - Aggregations
16) Saving Data in Hive
17) Hive Tables - DDL - ALTER
18) Partitions in Hive
19) Views in Hive
20) Load JSON Data
21) Sorting & Distributing - Order By, Sort By, Distribute By, Cluster By
22) Bucketing in Hive
23) Hive - ORC Files
24) Connecting to Tableau using Hive
25) Analyzing MovieLens Data using Hive
26) Hands-on demos on CloudxLab
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Loading XML
2) What is RPC - Remote Process Call
3) Loading AVRO
4) Data Sources - Parquet
5) Creating DataFrames From Hive Table
6) Setting up Distributed SQL Engine
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA
This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial:
1) Spark Architecture
2) Why Apache Spark?
3) Shortcoming of MapReduce
4) Downloading Apache Spark
5) Starting Spark With Scala Interactive Shell
6) Starting Spark With Python Interactive Shell
7) Getting started with spark-submit
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2xkCd84
This CloudxLab Introduction to Hive tutorial helps you to understand Hive in detail. Below are the topics covered in this tutorial:
1) Hive Introduction
2) Why Do We Need Hive?
3) Hive - Components
4) Hive - Limitations
5) Hive - Data Types
6) Hive - Metastore
7) Hive - Warehouse
8) Accessing Hive using Command Line
9) Accessing Hive using Hue
10) Tables in Hive - Managed and External
11) Hive - Loading Data From Local Directory
12) Hive - Loading Data From HDFS
13) S3 Based External Tables in Hive
14) Hive - Select Statements
15) Hive - Aggregations
16) Saving Data in Hive
17) Hive Tables - DDL - ALTER
18) Partitions in Hive
19) Views in Hive
20) Load JSON Data
21) Sorting & Distributing - Order By, Sort By, Distribute By, Cluster By
22) Bucketing in Hive
23) Hive - ORC Files
24) Connecting to Tableau using Hive
25) Analyzing MovieLens Data using Hive
26) Hands-on demos on CloudxLab
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kBMiEt
This CloudxLab Understanding Sqoop tutorial helps you to understand Sqoop in detail. Below are the topics covered in this tutorial:
1) Introduction to Sqoop
2) Sqoop Import - MySQL to HDFS
3) Sqoop Import - MySQL to Hive
4) Sqoop Import - MySQL to HBase
5) Sqoop Export - Hive to MySQL
A very short set of slides to describe an RDD data structure.
Extracted from my 3-day course: www.sparkInternals.com
There is also a video of this on YouTube: http://youtu.be/odcEg515Ne8
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames and the Spark TimeSeries library. For data frames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics. For the time series library, I dive into the kind of use cases it supports and why it’s actually super useful.
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Streams are often underestimated and skipped as possible solutions. In many cases, we created much more complex solutions than their streams counterpart. Why?
It's hard to answer, but in this presentation, I would like to tell you a story about how we started to use FS2, without sacrificing purity and code readability. (https://github.com/functional-streams-for-scala/fs2)
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kBMiEt
This CloudxLab Understanding Sqoop tutorial helps you to understand Sqoop in detail. Below are the topics covered in this tutorial:
1) Introduction to Sqoop
2) Sqoop Import - MySQL to HDFS
3) Sqoop Import - MySQL to Hive
4) Sqoop Import - MySQL to HBase
5) Sqoop Export - Hive to MySQL
A very short set of slides to describe an RDD data structure.
Extracted from my 3-day course: www.sparkInternals.com
There is also a video of this on YouTube: http://youtu.be/odcEg515Ne8
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames and the Spark TimeSeries library. For data frames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics. For the time series library, I dive into the kind of use cases it supports and why it’s actually super useful.
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Streams are often underestimated and skipped as possible solutions. In many cases, we created much more complex solutions than their streams counterpart. Why?
It's hard to answer, but in this presentation, I would like to tell you a story about how we started to use FS2, without sacrificing purity and code readability. (https://github.com/functional-streams-for-scala/fs2)
Slides for presentation on Cloudera Impala I gave at the DC/NOVA Java Users Group on 7/9/2013. It is a slightly updated set of slides from the ones I uploaded a few months ago on 4/19/2013. It covers version 1.0.1 and also includes some new slides on HortonWorks' Stinger Initiative.
Spark-HandsOn
In this Hands-On, we are going to show how you can use Apache Spark and some components of it ecosystem for data processing. This workshop is split in four parts. We will use a dataset that consists of tweets containing just a few fields like id, user, text, country and place.
In the first one, you will play with the Spark API for basic operations like counting, filtering, aggregating.
After that, you will get to know Spark SQL to query structured data (here in json) using SQL.
In the third part, you will use Spark Streaming and the twitter streaming API to analyse a live stream of Tweets.
To finish we will build a simple model to identify the language in a text. For that you will use MLLib.
Let's go and have fun !
Prerequisites
Java > 6 (8 is better to use the lambdas)
IDE
Some links
Apache Spark https://spark.apache.org
https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark
https://speakerdeck.com/samklr/scalable-machine-learning-with-spark
Hadoop and object stores: Can we do it better?gvernik
Strata Data Conference, London, May 2017
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
Compare and contrast using Spark, Hive and Pig for transformation processing requirements. Video of my "talk" at https://www.youtube.com/watch?v=36_MayK5eU4.
Conference page for the talk is at https://devnexus.com/s/devnexus2017/presentations/17533.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
Understanding computer vision with Deep LearningCloudxLab
Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal.
Topics covered:
1. Overview of Machine Learning
2. Basics of Deep Learning
3. What is computer vision and its use-cases?
4. Various algorithms used in Computer Vision (mostly CNN)
5. Live hands-on demo of either Auto Cameraman or Face recognition system
6. What next?
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm5Ekd
This CloudxLab Key-Value RDD Transformations tutorial helps you to understand Key-Value RDD transformations in detail. Below are the topics covered in this tutorial:
1) Transformations on Key-Value Pair RDD - keys(), values(), groupByKey(), combineByKey(), sortByKey(), subtractByKey(), join(), leftOuterJoin(), rightOuterJoin(), cogroup(), countByKey() and lookup()
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyP2Ct
This CloudxLab Introduction to NoSQL tutorial helps you to understand NoSQL in detail. Below are the topics covered in this slide:
1) Introduction to NoSQL
2) Scaling Out vs Scaling Up
3) ACID - Properties of DB Transactions
4) RDBMS - Story
5) What is NoSQL?
6) Types Of NoSQL Stores
7) CAP Theorem
8) Serialization
9) Column Oriented Database
10) Column Family Oriented DataStore
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/6n3vko )
This CloudxLab TensorFlow tutorial helps you to understand TensorFlow in detail. Below are the topics covered in this tutorial:
1) Why TensorFlow?
2) What are Tensors?
3) What is TensorFlow?
4) Creating your First Graph
5) Linear Regression with TensorFlow
6) Implementing Gradient Descent using TensorFlow
7) Implementing Gradient Descent Using autodiff
8) Implementing Gradient Descent Using an Optimizer
9) Graph Visualization using TensorBoard
10) Name Scopes in TensorFlow
11) Modularity in TensorFlow
12) Sharing Variables in TensorFlow
Introduction to Deep Learning | CloudxLabCloudxLab
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/goQxnL )
This CloudxLab Deep Learning tutorial helps you to understand Deep Learning in detail. Below are the topics covered in this tutorial:
1) What is Deep Learning
2) Deep Learning Applications
3) Artificial Neural Network
4) Deep Learning Neural Networks
5) Deep Learning Frameworks
6) AI vs Machine Learning
In this tutorial, we will learn the the following topics -
+ The Curse of Dimensionality
+ Main Approaches for Dimensionality Reduction
+ PCA - Principal Component Analysis
+ Kernel PCA
+ LLE
+ Other Dimensionality Reduction Techniques
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
In this tutorial, we will learn the the following topics -
+ Training and Visualizing a Decision Tree
+ Making Predictions
+ Estimating Class Probabilities
+ The CART Training Algorithm
+ Computational Complexity
+ Gini Impurity or Entropy?
+ Regularization Hyperparameters
+ Regression
+ Instability
In this tutorial, we will learn the the following topics -
+ Linear SVM Classification
+ Soft Margin Classification
+ Nonlinear SVM Classification
+ Polynomial Kernel
+ Adding Similarity Features
+ Gaussian RBF Kernel
+ Computational Complexity
+ SVM Regression
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2wLh5aF
This CloudxLab Introduction to Linux helps you to understand Linux in detail. Below are the topics covered in this tutorial:
1) Linux Overview
2) Linux Components - The Programs, The Kernel, The Shell
3) Overview of Linux File System
4) Connect to Linux Console
5) Linux - Quick Start Commands
6) Overview of Linux File System
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
2. Loading + Saving Data
1. So far we
a. either converted in-memory data
b. Or used the HDFS file
Loading & Saving Data
3. Loading + Saving Data
1. So far we
a. either converted in-memory data
b. Or used the HDFS file
2. Spark supports wide variety of datasets
3. Can access data through InputFormat & OutputFormat
a. The interfaces used by Hadoop
b. Available for many common file formats and storage
systems (e.g., S3, HDFS, Cassandra, HBase, etc.).
Loading & Saving Data
5. Loading + Saving Data
Common Data Sources
File formats Stores
● Text, JSON, SequenceFiles,
Protocol buffers.
● We can also configure
compression
6. Loading + Saving Data
Common Data Sources
File formats Stores
● Text, JSON, SequenceFiles,
Protocol buffers.
● We can also configure
compression
Filesystems
● Local, NFS, HDFS, Amazon S3
Databases and key/value stores
● For Cassandra, HBase, Elasticsearch,
and JDBC databases.
7. Loading + Saving Data
Structured data sources through Spark SQL aka Data Frames
+ Efficient API for structured data sources, including JSON and Apache Hive
+ Covered later
Loading & Saving Data
8. Loading + Saving Data
Common supported file formats
+ A file could be in any format
+ If we know upfront, we can read it and load it
+ Else we can use “file” command like tool
9. Loading + Saving Data
Common supported file formats
+ Very common. Plain old text files. Printable chars
+ Records are assumed to be one per line.
+ Unstructured Data
Text files
Example:
10. Loading + Saving Data
Common supported file formats
+ Javascript Object Notation
+ Common text-based format
+ Semistructured; most libraries require one record per line.
JSON files
{
"name" : "John",
"age" : 31,
"knows" : [“C”, “C++”]
}
Example:
11. Loading + Saving Data
+ Very common text-based format
+ Often used with spreadsheet applications.
+ Comma separated Values
Common supported file formats
CSV files
Example:
12. Loading + Saving Data
+ Compact Hadoop file format used for key/value data.
+ Key and values can be binary data
+ To bundle together many small files
Common supported file formats
Sequence files
See More at https://wiki.apache.org/hadoop/SequenceFile
13. Loading + Saving Data
+ A fast, space-efficient multilanguage format.
+ More compact than JSON.
Common supported file formats
Protocol buffers
See More at https://developers.google.com/protocol-buffers/
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}
14. Loading + Saving Data
+ For data from a Spark job to be consumed by another
+ Breaks if you change your classes - Java Serialization.
Common supported file formats
Object Files
15. Loading + Saving Data
Loading Files
var input = sc.textFile("/data/ml-100k/u1.test")
Handling Text Files - scala
16. Loading + Saving Data
Loading Files
var input = sc.textFile("/data/ml-100k/u1.test")
Loading Directories
var input = sc.wholeTextFiles("/data/ml-100k");
var lengths = input.mapValues(x => x.length);
lengths.collect();
[(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/mku.sh', 643),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.data', 1979173),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.genre', 202),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.info', 36) …]
Handling Text Files - scala
17. Loading + Saving Data
Handling Text Files - scala
Loading Files
var input = sc.textFile("/data/ml-100k/u1.test")
Loading Directories
var input = sc.wholeTextFiles("/data/ml-100k");
var lengths = input.mapValues(x => x.length);
lengths.collect();
[(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/mku.sh', 643),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.data', 1979173),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.genre', 202),
(u'hdfs://ip-172-31-53-48.ec2.internal:8020/data/ml-100k/u.info', 36) …]
Saving Files
lengths.saveAsTextFile(outputDir)
18. Loading + Saving Data
1. Records are stored one per line,
2. Fixed number of fields per line
3. Fields are separated by a comma (tab in TSV)
4. We get row number to detect header etc.
Comma / Tab -Separated Values (CSV / TSV)
21. Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var a = sc.textFile("/data/spark/temps.csv");
var p = a.map(
line => {
val parser = new CSVParser(',')
parser.parseLine(line)
})
p.take(1)
//Array(Array(20, " NYC", " 2014-01-01"))
spark-shell --packages net.sf.opencsv:opencsv:2.3
Or
Add this to sbt: libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
Loading CSV - Example
https://gist.github.com/girisandeep/b721cf93981c338665c328441d419253
22. Loading + Saving Data
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
23. Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
24. Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {
val parser = new CSVParser(',')
for(line <- itr)
yield parser.parseLine(line)
}
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
25. Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {
val parser = new CSVParser(',')
for(line <- itr)
yield parser.parseLine(line)
}
//Check with simple example
val x = parseCSV(Array("1,2,3","a,b,c").iterator)
val result = linesRdd.mapPartitions(parseCSV)
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
26. Loading + Saving Data
import au.com.bytecode.opencsv.CSVParser
var linesRdd = sc.textFile("/data/spark/temps.csv");
def parseCSV(itr:Iterator[String]):Iterator[Array[String]] = {
val parser = new CSVParser(',')
for(line <- itr)
yield parser.parseLine(line)
}
//Check with simple example
val x = parseCSV(Array("1,2,3","a,b,c").iterator)
val result = linesRdd.mapPartitions(parseCSV)
result.take(1)
//Array[Array[String]] = Array(Array(20, " NYC", " 2014-01-01"))
Loading CSV - Example Efficient
https://gist.github.com/girisandeep/fddf49ef97fde429a0d3256160b257c1
27. Loading + Saving Data
Tab Separated Files
Similar to csv:
val parser = new CSVParser('t')
28. Loading + Saving Data
● Popular Hadoop format
○ For handling small files
○ Create InputSplits without too much transport
SequenceFiles
29. Loading + Saving Data
SequenceFiles
● Popular Hadoop format
○ For handling small files
○ Create InputSplits without too much transport
● Composed of flat files with key/value pairs.
● Has Sync markers
○ Allow to seek to a point
○ Then resynchronize with the record boundaries
○ Allows Spark to efficiently read in parallel from multiple nodes
30. Loading + Saving Data
Loading SequenceFiles
val data = sc.sequenceFile(inFile,
"org.apache.hadoop.io.Text", "org.apache.hadoop.io.IntWritable")
data.map(func)
…
data.saveAsSequenceFile(outputFile)
31. Loading + Saving Data
var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)))
rdd.saveAsSequenceFile("pysequencefile1")
Saving SequenceFiles - Example
32. Loading + Saving Data
var rdd = sc.parallelize(Array(("key1", 1.0), ("key2", 2.0), ("key3", 3.0)))
rdd.saveAsSequenceFile("pysequencefile1")
Saving SequenceFiles - Example
33. Loading + Saving Data
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
Loading SequenceFiles - Example
34. Loading + Saving Data
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
val myrdd = sc.sequenceFile(
"pysequencefile1",
classOf[Text], classOf[DoubleWritable])
Loading SequenceFiles - Example
35. Loading + Saving Data
Loading SequenceFiles - Example
import org.apache.hadoop.io.DoubleWritable
import org.apache.hadoop.io.Text
val myrdd = sc.sequenceFile(
"pysequencefile1",
classOf[Text], classOf[DoubleWritable])
val result = myrdd.map{case (x, y) => (x.toString, y.get())}
result.collect()
//Array((key1,1.0), (key2,2.0), (key3,3.0))
36. Loading + Saving Data
● Simple wrapper around SequenceFiles
● Values are written out using Java Serialization.
● Intended to be used for Spark jobs communicating with other
Spark jobs
● Can also be quite slow.
Object Files
37. Loading + Saving Data
Object Files
● Saving - saveAsObjectFile() on an RDD
● Loading - objectFile() on SparkContext
● Require almost no work to save almost arbitrary objects.
● Not available in python using pickle file instead
● If you change the objects, old files may not be valid
38. Loading + Saving Data
Pickle File
● Python way of handling object files
● Uses Python’s pickle serialization library
● Saving - saveAsPickleFile() on an RDD
● Loading - pickleFile() on SparkContext
● Can also be quite slow as Object Fiels
39. Loading + Saving Data
● Access Hadoop-supported storage formats
● Many key/value stores provide Hadoop input formats
● Example providers:HBase, MongoDB
● Older: hadoopFile() / saveAsHadoopFile()
● Newer: newAPIHadoopDataset() / saveAsNewAPIHadoopDataset()
● Takes a Configuration object on which you set the Hadoop properties
Non-filesystem data sources - hadoopFile
40. Loading + Saving Data
Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file
system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is
the same as for sc.sequenceFile.
A Hadoop configuration can be passed in as a Python dict. This will be converted into a
Configuration in Java.
Parameters:
path – path to Hadoop file
inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)
keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
keyConverter – (None by default)
valueConverter – (None by default)
conf – Hadoop configuration, passed in as a dict (None by default)
batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
Hadoop Input and Output Formats - Old API
hadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None,
valueConverter=None, conf=None, batchSize=0)
41. Loading + Saving Data
Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local
file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism
is the same as for sc.sequenceFile.
A Hadoop configuration can be passed in as a Python dict. This will be converted into a
Configuration in Java
Parameters:
path – path to Hadoop file
inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g.
“org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
keyConverter – (None by default)
valueConverter – (None by default)
conf – Hadoop configuration, passed in as a dict (None by default)
batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
Hadoop Input and Output Formats - New API
newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None,
valueConverter=None, conf=None, batchSize=0)
42. Loading + Saving Data
See https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
# set up parameters for reading from MongoDB via Hadoop input format
config = {"mongo.input.uri": "mongodb://localhost:27017/marketdata.minbars"}
inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
keyClassName = "org.apache.hadoop.io.Text"
valueClassName = "org.apache.hadoop.io.MapWritable"
# read the 1-minute bars from MongoDB into Spark RDD format
minBarRawRDD = sc.newAPIHadoopRDD(inputFormatClassName, keyClassName,
valueClassName, None, None, config)
Hadoop Input and Output Formats - Old API
Loading Data from mongodb
43. Loading + Saving Data
● Developed at Google for internal RPCs
● Open sourced
● Structured data - fields & types of fields defined
● Fast for encoding and decoding (20-100x than XML)
● Take up the minimum space (3-10x than xml)
● Defined using a domain-specific language
● Compiler generates accessor methods in variety of languages
● Consist of fields: optional, required, or repeated
● While parsing
○ A missing optional field => success
○ A missing required field => failure
● So, make new fields as optional (remember object file failures?)
Protocol buffers
44. Loading + Saving Data
Protocol buffers - Example
package tutorial;
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
message AddressBook {
repeated Person person = 1;
}
45. Loading + Saving Data
1. Download and install protocol buffer compiler
2. pip install protobuf
3. protoc -I=$SRC_DIR --python_out=$DST_DIR
$SRC_DIR/addressbook.proto
4. create objects
5. Convert those into protocol buffers
6. See this project
Protocol buffers - Steps
46. Loading + Saving Data
1. To Save Storage & Network Overhead
2. With most hadoop output formats we can specify compression codecs
3. Compression should not require the whole file at once
4. Each worker can find start of record => splitable
5. You can configure HDP for LZO using Ambari:
http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari
_reference_guide/content/_configure_core-sitexml_for_lzo.html
File Compression
47. Loading + Saving Data
File Compression Options
Characteristics of compression:
● Splittable
● Speed
● Effectiveness on Text
● Code
48. Loading + Saving Data
Format Splittable Spee
d
Effectiveness on
text
Hadoop compression codec comments
gzip N Fast High org.apache.hadoop.io.com.GzipCodec
lzo Y V.
Fast
Medium com.hadoop.compression.lzo.LzoCode
c
LZO requires
installation on
every worker
node
bzip2 Y Slow V. High org.apache.hadoop.io.com.BZip2Codec Uses pure Java
for splittable
version
zlib N Slow Medium org.apache.hadoop.io.com.DefaultCode
c
Default
compression
codec for
Hadoop
Snappy N V.
Fast
Low org.apache.hadoop.io.com.SnappyCod
ec
There is a pure
Java port of
Snappy but it is
not yet available
in Spark/
Hadoop
File Compression Options
49. Loading + Saving Data
1. Enable in HADOOP by updating the conf of hadoop
http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari_r
eference_guide/content/_configure_core-sitexml_for_lzo.html
2. Create data:
$ bzip2 --stdout file.bz2 | lzop -o file.lzo
3. Update Spark-env.sh with
export SPARK_CLASSPATH=$SPARK_CLASSPATH:hadoop-lzo-0.4.20-SNAPSHOT.jar
In your code, use:
conf.set("io.compression.codecs”, "com.hadoop.compression.lzo.LzopCodec”);
Ref: https://gist.github.com/zedar/c43cbc7ff7f98abee885
Handling LZO
51. Loading + Saving Data
1. rdd = sc.textFile("file:///home/holden/happypandas.gz")
2. The path has to be available on all nodes.
Otherwise, load it locally and distribute using sc.parallelize
Local/“Regular” FS
52. Loading + Saving Data
1. Popular option
2. Good if nodes are inside EC2
3. Use path in all input methods (textFile, hadoopFile etc)
s3n://bucket/path-within-bucket
4. Set Env. Vars: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
More details: https://cloudxlab.com/blog/access-s3-files-spark/
Amazon S3
53. Loading + Saving Data
1. The Hadoop Distributed File System
2. Spark and HDFS can be collocated on the same machines
3. Spark can take advantage of this data locality to avoid network overhead
4. In all i/o methods, use path: hdfs://master:port/path
5. Use only the version of spark w.r.t HDFS version
HDFS