Spark provides tools for distributed processing of large datasets across clusters. It includes APIs for distributed datasets called RDDs (Resilient Distributed Datasets) and transformations and actions that can be performed on those datasets in parallel. Key features of Spark include the Spark Shell for interactive use, DataFrames for structured data processing, and Spark Streaming for real-time data analysis.
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
This document provides an introduction to Big Data and Apache Hadoop. It defines Big Data as large and complex datasets that are difficult to process using traditional database tools. It describes how Hadoop uses MapReduce and HDFS to provide scalable storage and parallel processing of Big Data. It provides examples of companies using Hadoop to analyze exabytes of data and common Hadoop use cases like log analysis. Finally, it summarizes some popular Hadoop ecosystem projects like Hive, Pig, and Zookeeper that provide SQL-like querying, data flows, and coordination.
Productionizing Spark and the Spark Job ServerEvan Chan
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
This document discusses how to setup HBase with Docker in three configurations: single-node standalone, pseudo-distributed single-machine, and fully-distributed cluster. It describes features of HBase like consistent reads/writes, automatic sharding and failover. It provides instructions for installing HBase in a single node using Docker, including building an image and running it with ports exposed. It also covers running HBase in pseudo-distributed mode with the processes running as separate containers and interacting with the HBase shell.
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA
This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial:
1) Spark Architecture
2) Why Apache Spark?
3) Shortcoming of MapReduce
4) Downloading Apache Spark
5) Starting Spark With Scala Interactive Shell
6) Starting Spark With Python Interactive Shell
7) Getting started with spark-submit
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop.
Finally we introduce spark using a docker image and we show how to use anonymous function in spark.
The topics of the next slides will be
- Spark Shell (Scala, Python)
- Shark Shell
- Data Frames
- Spark Streaming
- Code Examples: Data Processing and Machine Learning
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
This document provides an introduction to Big Data and Apache Hadoop. It defines Big Data as large and complex datasets that are difficult to process using traditional database tools. It describes how Hadoop uses MapReduce and HDFS to provide scalable storage and parallel processing of Big Data. It provides examples of companies using Hadoop to analyze exabytes of data and common Hadoop use cases like log analysis. Finally, it summarizes some popular Hadoop ecosystem projects like Hive, Pig, and Zookeeper that provide SQL-like querying, data flows, and coordination.
Productionizing Spark and the Spark Job ServerEvan Chan
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
This document discusses how to setup HBase with Docker in three configurations: single-node standalone, pseudo-distributed single-machine, and fully-distributed cluster. It describes features of HBase like consistent reads/writes, automatic sharding and failover. It provides instructions for installing HBase in a single node using Docker, including building an image and running it with ports exposed. It also covers running HBase in pseudo-distributed mode with the processes running as separate containers and interacting with the HBase shell.
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA
This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial:
1) Spark Architecture
2) Why Apache Spark?
3) Shortcoming of MapReduce
4) Downloading Apache Spark
5) Starting Spark With Scala Interactive Shell
6) Starting Spark With Python Interactive Shell
7) Getting started with spark-submit
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop.
Finally we introduce spark using a docker image and we show how to use anonymous function in spark.
The topics of the next slides will be
- Spark Shell (Scala, Python)
- Shark Shell
- Data Frames
- Spark Streaming
- Code Examples: Data Processing and Machine Learning
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
The document discusses Spark exceptions and errors related to shuffling data between nodes. It notes that tasks can fail due to out of memory errors or files being closed prematurely. It also provides explanations of Spark's shuffle operations and how data is written and merged across nodes during shuffles.
This document provides instructions for setting up an Apache Hadoop cluster on Macintosh OSX. It describes installing and configuring Java, Hadoop, Hive, and MySQL on a "namenode" machine and multiple "datanode" machines. Key steps include installing software via Homebrew, configuring host files and SSH keys for passwordless login, creating configuration files for core Hadoop components and copying them to all datanodes, and installing scripts to help manage the cluster. The goal is to have a basic functioning Hadoop cluster on Mac OSX for testing and proof of concept purposes.
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
"In this session, Twitter engineer Alex Payne will explore how the popular social messaging service builds scalable, distributed systems in the Scala programming language. Since 2008, Twitter has moved the development of its most critical systems to Scala, which blends object-oriented and functional programming with the power, robust tooling, and vast library support of the Java Virtual Machine. Find out how to use the Scala components that Twitter has open sourced, and learn the patterns they employ for developing core infrastructure components in this exciting and increasingly popular language."
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
This document provides an overview of Apache Spark, including its core concepts, transformations and actions, persistence, parallelism, and examples. Spark is introduced as a fast and general engine for large-scale data processing, with advantages like in-memory computing, fault tolerance, and rich APIs. Key concepts covered include its resilient distributed datasets (RDDs) and lazy evaluation approach. The document also discusses Spark SQL, streaming, and integration with other tools.
The document discusses a course on data analytics that teaches Apache Hadoop. The course objectives are to develop scalable systems using Apache Hadoop, write MapReduce applications, differentiate SQL and NoSQL, and develop big data solutions using Hive and Pig. It covers Hadoop components like HDFS, YARN, and the Hadoop ecosystem including tools like Scoop, Hive, Pig, Flume, and Zookeeper. Issues with relational databases for big data are also discussed, as well as the need for Hadoop's distributed storage and processing.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
This document provides information about integrating Apache Solr and Apache Spark. It discusses using Solr as a data source and sink for Spark applications, including indexing data from Spark jobs into Solr in real-time and exposing Solr query results as Spark RDDs. The document also summarizes the Spark Streaming and RDD APIs and provides code examples for indexing tweets from Spark Streaming into Solr and reading from Solr into a DataFrame.
The document discusses Spark job failures and Spark/YARN architecture. It describes a Spark job failure due to a task failing 4 times with a NumberFormatException when parsing a string. It then explains that Spark jobs are divided into stages made up of tasks, and the entire job fails if a stage fails. The document also provides an overview of the Spark and YARN architectures, showing how Spark jobs are submitted to and run via the YARN resource manager.
Apache Sqoop: Unlocking Hadoop for Your Relational Database huguk
Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer
Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
This document provides an overview of Spark and its key components. Spark is a fast and general engine for large-scale data processing. It uses Resilient Distributed Datasets (RDDs) that allow data to be partitioned across clusters and cached in memory for fast performance. Spark is up to 100x faster than Hadoop for iterative jobs and provides a unified framework for batch processing, streaming, SQL, and machine learning workloads.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
The document discusses Spark exceptions and errors related to shuffling data between nodes. It notes that tasks can fail due to out of memory errors or files being closed prematurely. It also provides explanations of Spark's shuffle operations and how data is written and merged across nodes during shuffles.
This document provides instructions for setting up an Apache Hadoop cluster on Macintosh OSX. It describes installing and configuring Java, Hadoop, Hive, and MySQL on a "namenode" machine and multiple "datanode" machines. Key steps include installing software via Homebrew, configuring host files and SSH keys for passwordless login, creating configuration files for core Hadoop components and copying them to all datanodes, and installing scripts to help manage the cluster. The goal is to have a basic functioning Hadoop cluster on Mac OSX for testing and proof of concept purposes.
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
"In this session, Twitter engineer Alex Payne will explore how the popular social messaging service builds scalable, distributed systems in the Scala programming language. Since 2008, Twitter has moved the development of its most critical systems to Scala, which blends object-oriented and functional programming with the power, robust tooling, and vast library support of the Java Virtual Machine. Find out how to use the Scala components that Twitter has open sourced, and learn the patterns they employ for developing core infrastructure components in this exciting and increasingly popular language."
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
This document provides an overview of Apache Spark, including its core concepts, transformations and actions, persistence, parallelism, and examples. Spark is introduced as a fast and general engine for large-scale data processing, with advantages like in-memory computing, fault tolerance, and rich APIs. Key concepts covered include its resilient distributed datasets (RDDs) and lazy evaluation approach. The document also discusses Spark SQL, streaming, and integration with other tools.
The document discusses a course on data analytics that teaches Apache Hadoop. The course objectives are to develop scalable systems using Apache Hadoop, write MapReduce applications, differentiate SQL and NoSQL, and develop big data solutions using Hive and Pig. It covers Hadoop components like HDFS, YARN, and the Hadoop ecosystem including tools like Scoop, Hive, Pig, Flume, and Zookeeper. Issues with relational databases for big data are also discussed, as well as the need for Hadoop's distributed storage and processing.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
This document provides information about integrating Apache Solr and Apache Spark. It discusses using Solr as a data source and sink for Spark applications, including indexing data from Spark jobs into Solr in real-time and exposing Solr query results as Spark RDDs. The document also summarizes the Spark Streaming and RDD APIs and provides code examples for indexing tweets from Spark Streaming into Solr and reading from Solr into a DataFrame.
The document discusses Spark job failures and Spark/YARN architecture. It describes a Spark job failure due to a task failing 4 times with a NumberFormatException when parsing a string. It then explains that Spark jobs are divided into stages made up of tasks, and the entire job fails if a stage fails. The document also provides an overview of the Spark and YARN architectures, showing how Spark jobs are submitted to and run via the YARN resource manager.
Apache Sqoop: Unlocking Hadoop for Your Relational Database huguk
Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer
Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
This document provides an overview of Spark and its key components. Spark is a fast and general engine for large-scale data processing. It uses Resilient Distributed Datasets (RDDs) that allow data to be partitioned across clusters and cached in memory for fast performance. Spark is up to 100x faster than Hadoop for iterative jobs and provides a unified framework for batch processing, streaming, SQL, and machine learning workloads.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Spark real world use cases and optimizationsGal Marder
This document provides an overview of Spark, its core abstraction of resilient distributed datasets (RDDs), and common transformations and actions. It discusses how Spark partitions and distributes data across a cluster, its lazy evaluation model, and the concept of dependencies between RDDs. Common use cases like word counting, bucketing user data, finding top results, and analytics reporting are demonstrated. Key topics covered include avoiding expensive shuffle operations, choosing optimal aggregation methods, and potentially caching data in memory.
Spark is a fast and general engine for large-scale data processing. It was designed to be fast, easy to use and supports machine learning. Spark achieves high performance by keeping data in-memory as much as possible using its Resilient Distributed Datasets (RDDs) abstraction. RDDs allow data to be partitioned across nodes and operations are performed in parallel. The Spark architecture uses a master-slave model with a driver program coordinating execution across worker nodes. Transformations operate on RDDs to produce new RDDs while actions trigger job execution and return results.
The document discusses Spark, an open-source cluster computing framework. It describes Spark's Resilient Distributed Dataset (RDD) as an immutable and partitioned collection that can automatically recover from node failures. RDDs can be created from data sources like files or existing collections. Transformations create new RDDs from existing ones lazily, while actions return values to the driver program. Spark supports operations like WordCount through transformations like flatMap and reduceByKey. It uses stages and shuffling to distribute operations across a cluster in a fault-tolerant manner. Spark Streaming processes live data streams by dividing them into batches treated as RDDs. Spark SQL allows querying data through SQL on DataFrames.
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
This document provides an overview of using Apache Spark with Apache Solr. It discusses using Solr as a data source for Spark SQL, reading data from Solr into Spark RDDs, querying Solr from the Spark shell, indexing data from Spark Streaming into Solr, and an example of using Solr as a sink for a Spark Streaming application that processes tweets in real-time.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
This document provides an overview of Apache Spark and machine learning using Spark. It introduces the speaker and objectives. It then covers Spark concepts including its architecture, RDDs, transformations and actions. It demonstrates working with RDDs and DataFrames. Finally, it discusses machine learning libraries available in Spark like MLib and how Spark can be used for supervised machine learning tasks.
Spark is an open-source cluster computing framework. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark supports batch, streaming, and interactive computations in a unified framework. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across a cluster for parallel processing. RDDs support transformations like map and filter that return new RDDs and actions that return values to the driver program.
Here are the steps to complete the assignment:
1. Create RDDs to filter each file for lines containing "Spark":
val readme = sc.textFile("README.md").filter(_.contains("Spark"))
val changes = sc.textFile("CHANGES.txt").filter(_.contains("Spark"))
2. Perform WordCount on each:
val readmeCounts = readme.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
val changesCounts = changes.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
3. Join the two RDDs:
val joined = readmeCounts.join(changes
Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.
Apache Spark is a fast and general engine for large-scale data processing. It was originally developed in 2009 and is now supported by Databricks. Spark provides APIs in Java, Scala, Python and can run on Hadoop, Mesos, standalone or in the cloud. It provides high-level APIs like Spark SQL, MLlib, GraphX and Spark Streaming for structured data processing, machine learning, graph analytics and stream processing.
This document provides an overview of Apache Spark, an open-source cluster computing framework. It discusses Spark's history and community growth. Key aspects covered include Resilient Distributed Datasets (RDDs) which allow transformations like map and filter, fault tolerance through lineage tracking, and caching data in memory or disk. Example applications demonstrated include log mining, machine learning algorithms, and Spark's libraries for SQL, streaming, and machine learning.
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.
In this lecture we analyze graph oriented databases. In particular, we consider TtitanDB as graph database. We analyze how to query using gremlin and how to create edges and vertex.
Finally, we presents how to use rexster to visualize the storeg graph/
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
The document discusses cloning Twitter using HBase. It describes some key features of Twitter like allowing users to post status updates, follow other users, mention users, and re-tweet posts. It then provides an overview of HBase including its features like consistency, automatic sharding and failover. It discusses how to install HBase in single node, pseudo-distributed and fully distributed modes using Docker. It also demonstrates some common HBase shell commands like creating and listing tables, putting and getting data. Finally, it discusses how to model the user, tweet, follower and following relationships in HBase.
In these slides we introduce Column-Oriented Stores. We deeply analyze Google BigTable. We discuss about features, data model, architecture, components and its implementation. In the second part we discuss all the major open source implementation for column-oriented databases.
This document discusses cloning Twitter using Redis by storing user, follower, and post data in Redis keys and data structures. It provides examples of how to store:
1) User profiles as Hashes with fields like username and ID.
2) Follower and following relationships as Sorted Sets with user IDs and timestamps.
3) User posts and timelines as Lists by pushing new post IDs.
It explains that while Redis lacks tables, its keys and data structures like Hashes, Sets and Lists allow building the same data model without secondary indexes. The document also notes that the system can scale horizontally by sharding the data across multiple Redis servers.
DynamoDB is a key-value database that achieves high availability and scalability through several techniques:
1. It uses consistent hashing to partition and replicate data across multiple storage nodes, allowing incremental scalability.
2. It employs vector clocks to maintain consistency among replicas during writes, decoupling version size from update rates.
3. For handling temporary failures, it uses sloppy quorum and hinted handoff to provide high availability and durability guarantees when some replicas are unavailable.
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download the slides
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me to download the slides
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me to download the slides
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...Fabio Fumarola
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
This document provides an introduction to HBase, including:
- An overview of BigTable, which HBase is modeled after
- Descriptions of the key features of HBase like being distributed, column-oriented, and versioned
- Examples of using the HBase shell to create tables, insert and retrieve data
- An explanation of the Java APIs for administering HBase, inserting/updating/retrieving data using Puts, Gets, and Scans
- Suggestions for setting up HBase with Docker for coding examples
Docker allows building and running applications inside lightweight containers. Some key benefits of Docker include:
- Portability - Dockerized applications are completely portable and can run on any infrastructure from development machines to production servers.
- Consistency - Docker ensures that application dependencies and environments are always the same, regardless of where the application is run.
- Efficiency - Docker containers are lightweight since they don't need virtualization layers like VMs. This allows for higher density and more efficient use of resources.
This document provides information about Linux containers and Docker. It discusses:
1) The evolution of IT from client-server models to thin apps running on any infrastructure and the challenges of ensuring consistent service interactions and deployments across environments.
2) Virtual machines and their benefits of full isolation but large disk usage, and Vagrant which allows packaging and provisioning of VMs via files.
3) Docker and how it uses Linux containers powered by namespaces and cgroups to deploy applications in lightweight portable containers that are more efficient than VMs. Examples of using Docker are provided.
This document lists and describes several large network dataset collections for research purposes. It includes social networks, communication networks, citation networks, collaboration networks, web graphs, product networks, road networks, and more. Sources provided include the Stanford Large Network Dataset Collection, a Twitter dataset, leaked Facebook pages, UCIrvine Datasets, and additional results. The datasets cover a wide range of network types and can be used to study interactions in online social networks, information cascades, and networked communities.
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce Fabio Fumarola
The document presents MrAdam, a parallel algorithm for approximate frequent itemset mining using MapReduce. MrAdam avoids expensive communication and synchronization costs by mining approximate frequent itemsets from big data with statistical error guarantees. It combines a statistical approach based on the Chernoff bound with MapReduce-based local model discovery and global combination through an SE-tree and structural interpolation. Experiments show MrAdam is 2 to 100 times faster than previous frequent itemset mining algorithms using MapReduce.
NoSQL databases are currently used in several applications scenarios in contrast to Relations Databases. Several type of Databases there exist. In this presentation we compare Key Value, Column Oriented, Document Oriented and Graph Databases. Using a simple case study there are evaluated pros and cons of the NoSQL databases taken into account.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
3. Start the docker container
• Pull the image
– From https://github.com/sequenceiq/docker-spark
– Via command: docker pull sequenceiq/spark:1.3.0
• Run the Docker
– Interactive: docker run –it –P sequenceiq/spark:1.3.0 bash
Or
– Daemon: docker run –d -P sequenceiq/spark:1.3.0 -d
3
4. Separate Container Master/Worker
Or in alternative
$ docker pull snufkin/spark-master
$ docker pull snufkin/spark-worker
•These images are based on snufkin/spark-base
$ docker run … master
$ docker run … worker
4
5. Start the spark shell
• Shell in YARN-client mode: the driver run in a client process
and the master is used to request resources from YARN
– spark-shell --master yarn-client --driver-memory 1g --executor-
memory 1g --executor-cores 1
• YARN-cluster mode: spark runs inside the master which is
managed by YARN
– spark-submit --class org.apache.spark.examples.SparkPi --master yarn-
cluster --driver-memory 1g --executor-memory 1g --executor-cores 1
$SPARK_HOME/lib/spark-examples-1.3.0-hadoop2.4.0.jar
5
7. Start the shell
• Scala Spark-shell local
– spark-shell --master local[2] --driver-memory 1g
--executor-memory 1g
• Python Spark-shell local
– pyspark --master local[2] --driver-memory 1g --executor-
memory 1g
7
8. RDD Basics
Internally, each RDD is characterized by five main properties:
•A list of partitions
•A function for computing each split
•A list of dependencies on other RDDs
•Optionally, a Partitioner for key-value RDDs (e.g. to say that the
RDD is hash-partitioned)
•Optionally, a list of preferred locations to compute each split on
(e.g. block locations for an HDFS file)
8
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
9. RDD Basics
• When a shell is started a SparkContext is created for you
• An RDD in Spark can be obtained via:
– Loading an external dataset with sc.textFile(…)
– Distributing a collection of object sc.parallelize(1 to 1000)
• Spark can read and distributed dataset from HDFS (hdfs://),
Cassandra, Hbase, Amazon S3 (s3://), etc
9
scala> sc
res0: org.apache.spark.SparkContext =org.apache.spark.SparkContext@5d02b84a
10. Creating an RDD from a file
If you run from YARN
•You need to interact with hdfs to list a file
– hadoop fs -ls /
– hdfs dfs –ls /
•Download a file
– wget http://pbdmng.datatoknowledge.it/files/access_log
– curl -O http://pbdmng.datatoknowledge.it/files/error_log
10
11. Creating an RDD from a file
• Copy to hdfs
– hadoop fs -copyFromLocal access_log ./
• List the files
– hadoop fs -ls ./
11
bash-4.1# hadoop fs -ls ./
Found 3 items
drwxr-xr-x - root supergroup 0 2015-05-28 05:06 .sparkStaging
drwxr-xr-x - root supergroup 0 2015-01-15 04:05 input
-rw-r--r-- 1 root supergroup 5589889 2015-05-28 05:44 access_log
http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-
common/FileSystemShell.html
12. Creating an RDD from a file
• Scala
val lines = sc.textFile("/user/root/access_log ")
lines.count
• Python
>>> lines = sc.textFile("/user/root/error_log")
>>> lines.count()
12
13. Creating an RDD
• Scala
scala> val rdd = sc.parallelize(1 to 1000)
• Python
>>> data = [1,2,3,4,5]
>>> rdd = sc.parallelize(data)
>>> rdd.count()
13
14. RDD Example
• Create an RDD of numbers from 1 to 1000 and sum
its elements
• Scala
scala> val rdd = sc.parallelize(1 to 1000)
scala> val sum = rdd.reduce((a,b) => a + b)
• Python
>>> rdd = sc.parallelize(range(1,1001))
>>> sum = rdd.reduce(lambda a, b: a + b)
14
15. RDD and Computation
• RDD are by default recomputed each time an action
is called
• To reuse the same RDD in multiple actions
– rdd.persist()
– rdd.cache()
15
16. When to Cache and when to Persist?
• With persist() and cache() on a RDD its partitions are stored in
memory buffers
– Spark limits the amount of memory by default to the 20% of the
overall JVM reserved heap
• Since the reserved cache is limited, sometime it is better to
call persist instead of cache on RDD
• Otherwise, cached RDD will be removed and needs to be
recomputed
• While persisted RDD can be persisted restored from the disk
16
18. Passing Functions to Spark
• Spark’s API relies on passing function in the driver
program to run on the cluster
• Recommendations for functions
– Anonymous functions
– Methods in a singleton objects
– Class with RDD as function parameters
18
19. Passing Functions to Spark: Scala
• Anonymous function syntax
scala> (x: Int) => x *x
res0: Int => Int = <function1>
• Singleton Object
scala> object MyFunctions {
| def func1(s: String): String = s + s
| }
scala> lines.map(MyFunctions.func1)
19
20. Passing Functions to Spark: Scala
• Class
scala> class MyClass {
| def func1(s: String): String = ???
| def doStuff(rdd: RDD[String]): RDD[String] = rdd.map(func1)
| }
• Class with a val
scala> class MyClass {
| val field = "hello"
| def doStuff(rdd: RDD[String]): RDD[String] = rdd.map(_ + field)
| }
20
21. Passing Functions to Spark:
Python
• Function
>>> if __name__ == "__main__":
... def myFunc(s):
... words = s.split(" ")
... return len(words)
• Class
>>> class MyClass(object):
... def func(self, s):
... return s
... def doStuff(self, rdd):
... return rdd.map(self.func)
21
22. Functions and Memory Usage
• Spark reserves the 20% of the allocated JVM heap to
store user functions
• When we create functions we should try to minimize
the code used
• Otherwise we can incur to memory issues
22
24. Transformations
• Are operations on RDDs that return a new RDD
• Transformed RDDs are computer lazily, only when an
action is called
• 2 Type of operations:
– Element-wise
– Partition-wise
24
25. Transformations: map
Scala
scala> val numbers = sc.parallelize(1 to 100)
scala> val result = numbers.map(x => x * 2)
Python
>>> numbers = sc.parallelize(range(1,101))
>>> result = numbers.map(lambda a : a * 2)
25
26. Transformations: flatMap
26
Scala
scala> val list = List(“hello world”, “hi”)
scala> val values= sc.parallelize(list)
scala> numbers.flatMap(l => l.split(“”))
Python
>>> numbers = sc.parallelize([“hello world”, “hi”]))
>>> result = numbers.flatMap(lambda line: line.split(“ “))
27. Transformations: filter
27
Scala
scala> val numbers = sc.parallelize(1 to 100)
scala> val result = numbers.filter(x => x % 2 == 0)
Python
>>> numbers = sc.parallelize(range(1,101))
>>> result = numbers.filter(lambda x : x % 2 == 0)
28. Transformations: mapPartitions
28
Scala
scala> val numbers = sc.parallelize(1 to 100)
scala> val result = numbers.mapPartitions(x => x * 2)
Python
>>> numbers = sc.parallelize(range(1,101))
>>> result = numbers.mapPartitions(lambda a : a * 2)
29. Transformations: mapPartitionsWithIndex
29
Scala
scala> val numbers = sc.parallelize(1 to 100)
scala> val result = numbers.mapPartitionWithIndex(_.map(e => e * 2)
Python
>>> numbers = sc.parallelize(range(1,101))
>>> result = numbers. mapPartitionWithIndex(lambda it : for e in it: e * 2)
30. Transformations: sample
30
Scala
scala> val numbers = sc.parallelize(1 to 100)
scala> val result = numbers.sample(false,0.5D)
Python
>>> numbers = sc.parallelize(range(1,101))
>>> result = numbers.sample(false,0.5)
31. Transformations: Union
31
Scala
scala> val list1= sc.parallelize(1 to 100)
scala> val list2= sc.parallelize(101 to 200)
scala> val result = list1.union(list2)
Python
>>> list1 = sc.parallelize(range(1,101))
>>> list2 = sc.parallelize(range(101,200))
>>> result = list1.union(list2)
32. Transformations: Intersection
32
Scala
scala> val list1= sc.parallelize(1 to 100)
scala> val list2= sc.parallelize(60 to 200)
scala> val result = list1.intersection(list2)
Python
>>> list1 = sc.parallelize(range(1,101))
>>> list2 = sc.parallelize(range(60,200))
>>> result = list1.intersection(list2)
33. Transformations: Distinct
33
Scala
scala> val list1= sc.parallelize(1 to 100)
scala> val list2= sc.parallelize(1 to 100)
scala> val result = list1.union(list2).distinct
Python
>>> list1 = sc.parallelize(range(1,101))
>>> list2 = sc.parallelize(range(1,101))
>>> result = list1.intersection(list2).distinct()
34. Other Transformations
• pipe(command, [envVars]) => Pipe each partition of the RDD
through a shell command, e.g. a R or bash script. RDD
elements are written to the process's stdin and lines output
to its stdout are returned as an RDD of strings.
• coalesce(numPartitions) => Decrease the number of partitions
in the RDD to numPartitions. Useful, when a RDD is shrink
after a filter operation
34
35. Other Transformations
• repartition(numPartitions) => Reshuffle the data in the RDD
randomly to create more or fewer partitions and balance it
across them. This always shuffles all data over the network.
• repartitionAndSortWithinPartitions(partitioner) =>
Repartition the RDD according to the given partitioner and,
within each resulting partition, sort records by their keys.
35
37. Actions
• Are used to spread the computation on the cluster
• Actions return a value to the driver program after
running a computation on the dataset
• For example:
– map is a transformation that passes each element to a
function
– reduce in an action that aggregates all the element using a
function and return the results to the driver program
37
38. Actions: reduce
• Aggregate the element using a function
(commutative and associative)
scala> val lines = sc.parallelize(1 to 1000)
scala> lines.reduce(_ + _)
38
39. Actions: collect
• Return all the elements of the dataset as an array at the
driver program.
• This is usually useful after a filter or other operation that
returns a sufficiently small subset of the data.
scala> val lines = sc.parallelize(1 to 1000)
scala> lines.collect
39
40. Actions: count, first, take(n)
scala> val lines = sc.parallelize(1 to 1000)
scala> lines.count
res1: Long = 1000
scala> lines.first
res2: Int = 1
scala> lines.take(5)
res4: Array[Int] = Array(1, 2, 3, 4, 5)
40
44. Motivation
• Pair RDDs are useful for operations that allow you to
work on each key in parallel
• Key/value RDD are commonly used to perform
aggregations
• Often we will do some initial ETL to get our data inot
key/value format
44
45. Why key/value pairs
• Let us consider an example
scala> val lines = sc.parallelize(1 to 1000)
scala> val fakePairs = lines.map(v => (v.toString, v))
• The type of pairs is RDD[(String, Int)] and exposes basic RDD
functions
• But, Spark provides PairRDDFunctions with methods on
key/value pairs
scala> import org.apache.spark.rdd.RDD._
scala> val pairs = rddToPairRDDFunctions(lines.map(i => i -> i.toString))
//<- from spark 1.3.0
45
46. Transformations for key/value
• groupByKey([numTasks]) => Called on a dataset of (K, V)
pairs, returns a dataset of (K, Iterable<V>) pairs.
• reduceByKey(func, [numTasks]) => Called on a dataset of (K,
V) pairs, returns a dataset of (K, V) pairs where the values for
each key are aggregated using the given reduce function func,
which must be of type (V,V) => V.
46
47. Transformations for key/value
• sortByKey([ascending], [numTasks]) => Called on a dataset of
(K, V) pairs where K implements Ordered, returns a dataset of
(K, V) pairs sorted by keys in ascending or descending order,
as specified in the boolean ascending argument.
• join(otherDataset, [numTasks]) => Called on datasets of type
(K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all
pairs of elements for each key. Outer joins are supported
through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
47
48. Transformations for key/value
• cogroup(otherDataset, [numTasks]) => Called on datasets of
type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>,
Iterable<W>)) tuples. This operation is also called groupWith.
• cartesian(otherDataset) => Called on datasets of types T
and U, returns a dataset of (T, U) pairs (all pairs of elements).
48
49. Aggregations with PairRDD
• If we have key/value pairs is common to want to
aggregate statistics across all elements of the same
key
• Examples are:
– Per key average
– Word count
49
50. Per Key Average
We use reduceByKey() with mapValues() to compute
per key average
>>> rdd.mapValues(lambda: x: (x,1)).reduceByKey(lambda x,y: (x[0] +
y[0], x[1] + y[1]))
scala> rdd.mapValues((_,1)).reduceByKey((x,y) => (x._1 + y._1, x._2 +
y._2))
50
51. Word Count
We can use the reduceByKey() function
>>> result = word.map(lambda x: (x,1)).reduceByKey(lambda x,y: x +
y)
scala> val result = words.map((_,1).reduceByKey(_ + _)
51
52. PairRDD Best Practices
• In base these operations involve data shuffling
• From Spark 1.0 PairRDD functions such as cogroup(),
join(), left and right join, groupByKey(),
reduceByKey() and lookup() benefit on data
partitioning.
• For example, in reduceByKey() the function is
computed locally and the final result is sent to the
network
52
53. PairRDD Best Practices
• In general is better to prefer the reduceByKey() to
the groupByKey()
53
54. PairRDD Best Practices
• In general is better to prefer the reduceByKey() to
the groupByKey()
54
56. Shared Variables
• EACH function passed to a Spark operation is
executed on a remote cluster node
• These variables are copied to each machine
• And no updates to the variables on the remote
machine are propagated back to the driver program
• To enable shared variables Spark supports: Broadcast
Variables and Accumulators
56
57. Broadcast Variables
• Allow the programmer to keep a read-only variable
cached on each machine.
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
scala> broadcastVar.value
>>> broadcastVar = sc.broadcast([1, 2, 3])
>>> broadcastVar.value
57
58. Accumulators
• They can be used to implement counters (as in
MapReduce) or sums.
scala> val accum = sc.accumulator(0, "My Accumulator")
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
scala> accum.value
>>> accum = sc.accumulator(0)
>>> sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))
>>> accum.value
58
60. Spark SQL
Provides 3 main capabilities:
1.Load data from different sources (JSON, Hive and
Parquet)
2.Query the data using SQL
3.Integration between SQL and regular
Python/Java/Scala API
This API are changing due to the DataFrames API
60
61. Initializing Spark SQL
The entrypoint to create a basic SQLContext, all you need is a
SparkContext.
•If we have a link to Hive
scala> import org.apache.spark.sql.hive.HiveContext
scala> val hiveContext = new HiveContext(sc)
•otherwise
scala> import org.apache.spark.sql.SQLContext
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
61
62. Basic Query Example
• To make a query on a table we call the sql() method
on the hive or sql context
scala> val table = hiveContext.jsonFile("file.json")
scala> table.registerTempTable("tweets")
scala> val topTweets = hiveContext.sql("Select text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10")
Download file from: https://raw.githubusercontent.com/databricks/learning-
spark/master/files/testweet.json
62
63. Schema RDD
• Both loading data and executing queries return a
SchemaRDD.
• SchemarRDD is an RDD composed of:
– Row objects with
– Information about schema and columns
• Row objects are wrappers around arrays of basic
types (integer, string, double,…)
63
64. Data Types
• All data types of Spark SQL are located and visible at
scala> import org.apache.spark.sql.types._
64
http://spark.apache.org/docs/latest/sql-programming-guide.html#data_types
65. Loading and Saving Data
• Spark SQL supports different structured data sources
out of the box:
– Hive tables,
– JSON,
– Parquet files, and
– JDBC NoSQL
– Regular RDDs converted
65
66. Apache Hive
• In this scenario, Spark SQL supports any Hive-
supported storage format:
– Text files, RCFiles, Parquet, Avro, ProtoBuff
scala> import org.apache.spark.sql.hive.HiveContext
scala> val hiveContext = new HiveContext(sc)
scala> val rows = hiveContext.sql(SELECT key, value FROM mytable)
scala> val keys = rows.map(row => row.getInt(0))
66
68. Parquet.io
• Column-Oriented storage format that can store records with
nested fields efficiently.
• Spark SQL support reading and writing from this format
scala> val people: RDD[Person] = ...
scala> people.saveAsParquetFile("people.parquet")
scala> val parquetFile = sqlContext.parquetFile("people.parquet")
scala> parquetFile.registerTempTable("parquetFile")
scala> val teenagers = sqlContext.sql("SELECT name FROM parquetFile
WHERE age >= 13 AND age <= 19")
68
69. JSON
• Spark load JSON from:
– jsonFile: loads data from directory of json files
– jsonRDD: load data from RDD of JSON objects
scala> val path = "examples/src/main/resources/people.json”
scala> val people = sqlContext.jsonFile(path)
scala> people.printSchema()
scala> val anotherPeopleRDD = sc.parallelize(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
scala> val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)
69
70. Partition Discovery
• Table partitioning is a common optimization approach used in
systems like Hive
• Data are usually stored in different directories, with
partitioning column values encoded in the path of each
partition directory
70
71. DataFrames API
• It is a distributed collection of data organized into named
columns
• equivalent to a table in a relational database or a data frame
in R/Python
scala> val df = sqlContext.jsonFile("examples/src/main/resources/people.json")
scala> df.show()
scala> df.select("name", "age + 1").show()
scala> df.filter(df("age") > 21).show()
• Not stable right now
•
•
71
73. Overview
• Extension of the core API for processing live data streams
• Data can be ingested from: kafka, Flume, Twitter, ZeroMQ,
Kinesis or TCP sockets
• And can be processed using complex algorithms expressed
with high-level functions like map, reduce, join and window.
73
74. How it works internally
• It receives live input data streams and divides the
data into batches
• These batches are processed by the Spark engine to
generate the final stream of results in batches.
74
75. Example: Word Count
• Create the streaming context
scala> import org.apache.spark._
scala> import org.apache.spark.streaming._
scala> val ssc = new StreamingContext(sc, Seconds(5))
• Create a DStream
scala> val lines = ssc.socketTextStream("localhost", 9999)
scala> val words = lines.flatMap(_.split(" "))
75
76. Example: Word Count
• Perform the streaming word count
scala> val words = lines.flatMap(_.split(" "))
scala> val pairs = words.map(word => (word, 1))
scala> val wordCounts = pairs.reduceByKey(_ + _)
scala> wordCounts.print()
• Start the streaming processing
scala> ssc.start()
scala> ssc.awaitTermination()
76
77. Example Word Count
• Start a shell and Install netcat
– Docker exec –it <docker name> bash
– yum install nc.x86_64
• Start a netcat on port 9999
– nc -lk 9999
• Write some words
77
78. Discretized DStreams
• It represents a continuous stream of data
• Internally, a DStream is represented by a continuous series of
RDDs
• Each RDD in a DStream contains data from a certain interval
78
79. Operation on DStreams
• Any operation applied on a DStream translates to operations
on the underlying RDDs
79
80. Streaming Sources
• Apart the example, we can create streams from:
– Basic sources: files (HDFS, S3, NFS) or from Akka Actors
and Queue of RDDs as a Stream (for test)
– Advanced sources: from systems like, kafka, Flume,
Twitter, ZeroMQ, Kinesis
• Advanced Source are used as external libs
80
81. Advanced Source: Twitter
• Linking: Add the artifact spark-streaming-twitter_2.10 to the
SBT/Maven project dependencies.
• Programming: create a DStream with
TwitterUtils.createStream
scala> import org.apache.spark.streaming.twitter._
scala> TwitterUtils.createStream(ssc, None)
81
84. Sliding Window Operations
• Spark Streaming also provides windowed computations
– window length - The duration of the window (3 in the figure)
– sliding interval - The interval at which the window operation is
performed (2 in the figure).
scala> val windowedWordCounts =
pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30),
Seconds(10))
84
87. Interactive Analysis
scala> sc
res: spark.SparkContext = spark.SparkContext@470d1f30
Load the data
scala> val pagecounts = sc.textFile("data/pagecounts")
INFO mapred.FileInputFormat: Total input paths to process : 74
pagecounts: spark.RDD[String] = MappedRDD[1] at textFile at <console>:12
87
88. Interactive Analysis
• Get the first 10 records
scala> pagecounts.take(10)
• Print the element
scala> pagecounts.take(10).foreach(println)
20090505-000000 aa.b ?71G4Bo1cAdWyg 1 14463
20090505-000000 aa.b Special:Statistics 1 840
20090505-000000 aa.b
Special:Whatlinkshere/MediaWiki:Returnto 1 1019
88
90. Interactive Analysis
• To avoid reload the RDD in memory for each operation we
can cache it
scala> val enPages = pagecounts.filter(_.split(" ")(1) ==
"en").cache
• Next time we call an operation on enPages it will be executed
from cache
scala> enPages.count
90
91. Interactive Analysis
• Let us generate a histogram of total pages on Wikipedia pages
for the date range in out dataset
scala> val enTuples = enPages.map(line => line.split(" "))
scala> val enKeyValuePairs = enTuples.map(line => (line(0).substring(0, 8),
line(3).toInt))
scala> enKeyValuePairs.reduceByKey(_+_, 1).collect
91
92. Other Exercise series
• Spark SQL: Use the Spark shell to write interactive
SQL queries
• Tachyon: Deploy Tachyon and try simple
functionalities.
• MLib: Build a movie recommender with Spark
• GraphX: Explore graph-structured data and graph
algorithms
92
http://ampcamp.berkeley.edu/5/
http://ampcamp.berkeley.edu/big-data-
mini-course-home/
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.