SlideShare a Scribd company logo
1 of 13
Download to read offline
Spark Internals
Training Contents
page2 ☛ Course Contents
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
Spark Internals
Course Contents
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
☛
Distributing Data with HDFS Day1
Understanding Hadoop I/O
Spark Introduction
RDDs
RDD Internals:Part-1
RDD Internals:Part-2 Day2
Data ingress and egress
Running on a Cluster
Spark Internals
Advanced Spark Programming
Spark Streaming
Spark SQL Day3
Tuning and Debugging
Spark Kafka Internals
Storm Internals
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
Spark Internals
■ Distributing Data with HDFS
• Interfaces
• Hadoop Filesystems
• The Design of HDFS
■ Parallel Copying with distcp
• Keeping an HDFS Cluster Balanced
• Hadoop Archives
■
■
Using Hadoop Archives
• Limitations
Data Flow
• Anatomy of a File Write
• Anatomy of a File Read
• Coherency Model
■
■
The Command-Line Interface
• Basic Filesystem Operations
The Java Interface
• Querying the Filesystem
• Reading Data Using the FileSystem API
• Directories
• Deleting Data
• Reading Data from a Hadoop URL
• Writing Data
■ Understanding Hadoop I/O
■ Serialization
• Implementing a Custom Writable
• Serialization Frameworks
• The Writable Interface
• Writable Classes
• Avro
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■ Data Integrity
• ChecksumFileSystem
• LocalFileSystem
• Data Integrity in HDFS
■ ORC Files
• Large size enables efficient read of columns
• New types (datetime, decimal)
• Encoding specific to the column type
• Default stripe size is 250 MB
• A single file as output of each task
• Split files without scanning for markers
• Bound the amount of memory required for reading or writing.
• Lowers pressure on the NameNode
• Dramatically simplifies integration with Hive
• Break file into sets of rows called a stripe
• Complex types (struct, list, map, union)
• Support for the Hive type model
■ ORC File:Footer
• Count, min, max, and sum for each column
• Types, number of rows
• Contains list of stripes
■ ORC Files:Index
• Required for skipping rows
• Position in each stream
• Min and max for each column
• Currently every 10,000 rows
• Could include bit field or bloom filter
■ ORC Files:Postscript
• Contains compression parameters
• Size of compressed footer
■ ORC Files:Data
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■
■
■
• Directory of stream locations
• Required for table scan
Parquet
• Nested Encoding
• Configurations
• Error recovery
• Extensibility
• Nulls
• File format
• Data Pages
• Motivation
• Unit of parallelization
• Logical Types
• Metadata
• Modules
• Column chunks
• Separating metadata and column data
• Checksumming
• Types
File-Based Data Structures
• MapFile
• SequenceFile
Compression
• Codecs
• Using Compression in MapReduce
• Compression and Input Splits
■ Spark Introduction
• GraphX
• MLlib
• Spark SQL
• Data Processing Applications
• Spark Streaming
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■
■
• What Is Apache Spark?
• Data Science Tasks
• Spark Core
• Storage Layers for Spark
• Who Uses Spark, and for What?
• A Unified Stack
• Cluster Managers
RDDs
• Lazy Evaluation
• Common Transformations and Actions
• Passing Functions to Spark
• Creating RDDs
• RDD Operations
• Actions
• Transformations
• Scala
• Java
• Persistence
• Python
• Converting Between RDD Types
• RDD Basics
• Basic RDDs
RDD Internals:Part-1
• Expressing Existing Programming Models
• Fault Recovery
• Interpreter Integration
• Memory Management
• Implementation
• MapReduce
• RDD Operations in Spark
• Iterative MapReduce
• Console Log Minning
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■
■
• Google's Pregel
• User Applications Built with Spark
• Behavior with Insufficient Memory
• Support for Checkpointing
• A Fault-Tolerant Abstraction
• Evaluation
• Job Scheduling
• Spark Programming Interface
• Advantages of the RDD Model
• Understanding the Speedup
• Leveraging RDDs for Debugging
• Iterative Machine Learning Applications
• Explaining the Expressivity of RDDs
• Representing RDDs
• Applications Not Suitable for RDDs
RDD Internals:Part-2
• Sorting Data
• Operations That Affect Partitioning
• Determining an RDD’s Partitioner
• Grouping Data
• Motivation
• Aggregations
• Data Partitioning (Advanced)
• Actions Available on Pair RDDs
• Joins
• Creating Pair RDDs
• Operations That Benefit from Partitioning
• Transformations on Pair RDDs
• Example: PageRank
• Custom Partitioners
Data ingress and egress
• Hadoop Input and Output Formats
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
• File Formats
• Local/“Regular” FS
• Text Files
• Java Database Connectivity
• Structured Data with Spark SQL
• Elasticsearch
• File Compression
• Apache Hive
• Cassandra
• Object Files
• Comma-Separated Values and Tab-Separated Values
• HBase
• Databases
• Filesystems
• SequenceFiles
• JSON
• HDFS
• Motivation
• JSON
• Amazon S3
■ Running on a Cluster
• Scheduling Within and Between Spark Applications
• Spark Runtime Architecture
• A Scala Spark Application Built with sbt
• Packaging Your Code and Dependencies
• Launching a Program
• A Java Spark Application Built with Maven
• Hadoop YARN
• Deploying Applications with spark-submit
• The Driver
• Standalone Cluster Manager
• Cluster Managers
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
• Executors
• Amazon EC2
• Cluster Manager
• Dependency Conflicts
• Apache Mesos
• Which Cluster Manager to Use?
■ Spark Internals
■ Spark:YARN Mode
• Resource Manager
• Node Manager
• Workers
• Containers
• Threads
• Task
• Executers
• Application Master
• Multiple Applications
• Tuning Parameters
■ Spark:LocalMode
■ Spark Caching
• With Serialization
• Off-heap
• In Memory
■ Running on a Cluster
• Scheduling Within and Between Spark Applications
• Spark Runtime Architecture
• A Scala Spark Application Built with sbt
• Packaging Your Code and Dependencies
• Launching a Program
• A Java Spark Application Built with Maven
• Hadoop YARN
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
• Deploying Applications with spark-submit
• The Driver
• Standalone Cluster Manager
• Cluster Managers
• Executors
• Amazon EC2
• Cluster Manager
• Dependency Conflicts
• Apache Mesos
• Which Cluster Manager to Use?
■
■
Spark Serialization
StandAlone Mode
• Task
• Multiple Applications
• Executers
• Tuning Parameters
• Workers
• Threads
■ Advanced Spark Programming
• Working on a Per-Partition Basis
• Optimizing Broadcasts
• Accumulators
• Custom Accumulators
• Accumulators and Fault Tolerance
• Numeric RDD Operations
• Piping to External Programs
• Broadcast Variables
■ Spark Streaming
• Stateless Transformations
• Output Operations
• Checkpointing
• Core Sources
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■
■
• Receiver Fault Tolerance
• Worker Fault Tolerance
• Stateful Transformations
• Batch and Window Sizes
• Architecture and Abstraction
• Performance Considerations
• Streaming UI
• Driver Fault Tolerance
• Multiple Sources and Cluster Sizing
• Processing Guarantees
• A Simple Example
• Input Sources
• Additional Sources
• Transformations
Spark SQL
• User-Defined Functions
• Long-Lived Tables and Queries
• Spark SQL Performance
• Apache Hive
• Loading and Saving Data
• Initializing Spark SQL
• Parquet
• Performance Tuning Options
• SchemaRDDs
• Caching
• JSON
• From RDDs
• Linking with Spark SQL
• Spark SQL UDFs
• Using Spark SQL in Applications
• Basic Query Example
Tuning and Debugging Spark
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
• Driver and Executor Logs
• Memory Management
• Finding Information
• Configuring Spark with SparkConf
• Key Performance Considerations
• Components of Execution: Jobs, Tasks, and Stages
• Spark Web UI
• Hardware Provisioning
• Level of Parallelism
• Serialization Format
■ Kafka Internals
■
■
■
Kafka Core Concepts
• brokers
• Topics
• producers
• replicas
• Partitions
• consumers
Operating Kafka
• P&S tuning
• monitoring
• deploying
• Architecture
• hardware specs
Developing Kafka apps
• serialization
• compression
• testing
• Case Study
• reading from Kafka
• Writing to Kafka
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
■ Storm Internals
■ Developing Storm apps
• Case Studies
• Bolts and topologies
• P&S tuning
• serialization
• testing
• Kafka integration
■
■
Storm core concepts
• spouts
• groupings
• tuples
• bolts
• parallelism
• Topologies
Operating Storm
• monitoring
• Architecture
• deploying
• hardware specs
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

More Related Content

What's hot

Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!Databricks
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Databricks
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouDatabricks
 
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Databricks
 

What's hot (20)

Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
 
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
 

Similar to Spark Internals Training | Apache Spark | Spark | Anika Technologies

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...MongoDB
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Anthony Baker
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & pythonMaloy Manna, PMP®
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 

Similar to Spark Internals Training | Apache Spark | Spark | Anika Technologies (20)

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 

More from Anand Narayanan

Scrum Foundation Training by Anika Technologies
Scrum Foundation Training by Anika TechnologiesScrum Foundation Training by Anika Technologies
Scrum Foundation Training by Anika TechnologiesAnand Narayanan
 
Agile Essentials Training by Anika Technologies
Agile Essentials Training by Anika TechnologiesAgile Essentials Training by Anika Technologies
Agile Essentials Training by Anika TechnologiesAnand Narayanan
 
Smart Staffing using Regression Analysis Model
Smart Staffing using Regression Analysis ModelSmart Staffing using Regression Analysis Model
Smart Staffing using Regression Analysis ModelAnand Narayanan
 
Sentiment analysis using nlp
Sentiment analysis using nlpSentiment analysis using nlp
Sentiment analysis using nlpAnand Narayanan
 
Deep learning with_computer_vision
Deep learning with_computer_visionDeep learning with_computer_vision
Deep learning with_computer_visionAnand Narayanan
 
Advanced Elastic Search | Elastic Search | Kibana | Logstash
Advanced Elastic Search | Elastic Search | Kibana | LogstashAdvanced Elastic Search | Elastic Search | Kibana | Logstash
Advanced Elastic Search | Elastic Search | Kibana | LogstashAnand Narayanan
 
JVM and Java Performance Tuning | JVM Tuning | Java Performance
JVM and Java Performance Tuning | JVM Tuning | Java PerformanceJVM and Java Performance Tuning | JVM Tuning | Java Performance
JVM and Java Performance Tuning | JVM Tuning | Java PerformanceAnand Narayanan
 
Java Concurrency and Performance | Multi Threading | Concurrency | Java Conc...
Java Concurrency and Performance | Multi Threading  | Concurrency | Java Conc...Java Concurrency and Performance | Multi Threading  | Concurrency | Java Conc...
Java Concurrency and Performance | Multi Threading | Concurrency | Java Conc...Anand Narayanan
 
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Anand Narayanan
 
Big Data Analytics and Artifical Intelligence
Big Data Analytics and Artifical IntelligenceBig Data Analytics and Artifical Intelligence
Big Data Analytics and Artifical IntelligenceAnand Narayanan
 
SynopsisLowLatencySeminar.PDF
SynopsisLowLatencySeminar.PDFSynopsisLowLatencySeminar.PDF
SynopsisLowLatencySeminar.PDFAnand Narayanan
 

More from Anand Narayanan (12)

Scrum Foundation Training by Anika Technologies
Scrum Foundation Training by Anika TechnologiesScrum Foundation Training by Anika Technologies
Scrum Foundation Training by Anika Technologies
 
Agile Essentials Training by Anika Technologies
Agile Essentials Training by Anika TechnologiesAgile Essentials Training by Anika Technologies
Agile Essentials Training by Anika Technologies
 
Smart Staffing using Regression Analysis Model
Smart Staffing using Regression Analysis ModelSmart Staffing using Regression Analysis Model
Smart Staffing using Regression Analysis Model
 
Sentiment analysis using nlp
Sentiment analysis using nlpSentiment analysis using nlp
Sentiment analysis using nlp
 
Deep learning with_computer_vision
Deep learning with_computer_visionDeep learning with_computer_vision
Deep learning with_computer_vision
 
Advanced Elastic Search | Elastic Search | Kibana | Logstash
Advanced Elastic Search | Elastic Search | Kibana | LogstashAdvanced Elastic Search | Elastic Search | Kibana | Logstash
Advanced Elastic Search | Elastic Search | Kibana | Logstash
 
Deep learning internals
Deep learning internalsDeep learning internals
Deep learning internals
 
JVM and Java Performance Tuning | JVM Tuning | Java Performance
JVM and Java Performance Tuning | JVM Tuning | Java PerformanceJVM and Java Performance Tuning | JVM Tuning | Java Performance
JVM and Java Performance Tuning | JVM Tuning | Java Performance
 
Java Concurrency and Performance | Multi Threading | Concurrency | Java Conc...
Java Concurrency and Performance | Multi Threading  | Concurrency | Java Conc...Java Concurrency and Performance | Multi Threading  | Concurrency | Java Conc...
Java Concurrency and Performance | Multi Threading | Concurrency | Java Conc...
 
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
 
Big Data Analytics and Artifical Intelligence
Big Data Analytics and Artifical IntelligenceBig Data Analytics and Artifical Intelligence
Big Data Analytics and Artifical Intelligence
 
SynopsisLowLatencySeminar.PDF
SynopsisLowLatencySeminar.PDFSynopsisLowLatencySeminar.PDF
SynopsisLowLatencySeminar.PDF
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Spark Internals Training | Apache Spark | Spark | Anika Technologies

  • 1. Spark Internals Training Contents page2 ☛ Course Contents Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 2. Spark Internals Course Contents ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ ☛ Distributing Data with HDFS Day1 Understanding Hadoop I/O Spark Introduction RDDs RDD Internals:Part-1 RDD Internals:Part-2 Day2 Data ingress and egress Running on a Cluster Spark Internals Advanced Spark Programming Spark Streaming Spark SQL Day3 Tuning and Debugging Spark Kafka Internals Storm Internals Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 3. Spark Internals ■ Distributing Data with HDFS • Interfaces • Hadoop Filesystems • The Design of HDFS ■ Parallel Copying with distcp • Keeping an HDFS Cluster Balanced • Hadoop Archives ■ ■ Using Hadoop Archives • Limitations Data Flow • Anatomy of a File Write • Anatomy of a File Read • Coherency Model ■ ■ The Command-Line Interface • Basic Filesystem Operations The Java Interface • Querying the Filesystem • Reading Data Using the FileSystem API • Directories • Deleting Data • Reading Data from a Hadoop URL • Writing Data ■ Understanding Hadoop I/O ■ Serialization • Implementing a Custom Writable • Serialization Frameworks • The Writable Interface • Writable Classes • Avro Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 4. ■ Data Integrity • ChecksumFileSystem • LocalFileSystem • Data Integrity in HDFS ■ ORC Files • Large size enables efficient read of columns • New types (datetime, decimal) • Encoding specific to the column type • Default stripe size is 250 MB • A single file as output of each task • Split files without scanning for markers • Bound the amount of memory required for reading or writing. • Lowers pressure on the NameNode • Dramatically simplifies integration with Hive • Break file into sets of rows called a stripe • Complex types (struct, list, map, union) • Support for the Hive type model ■ ORC File:Footer • Count, min, max, and sum for each column • Types, number of rows • Contains list of stripes ■ ORC Files:Index • Required for skipping rows • Position in each stream • Min and max for each column • Currently every 10,000 rows • Could include bit field or bloom filter ■ ORC Files:Postscript • Contains compression parameters • Size of compressed footer ■ ORC Files:Data Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 5. ■ ■ ■ • Directory of stream locations • Required for table scan Parquet • Nested Encoding • Configurations • Error recovery • Extensibility • Nulls • File format • Data Pages • Motivation • Unit of parallelization • Logical Types • Metadata • Modules • Column chunks • Separating metadata and column data • Checksumming • Types File-Based Data Structures • MapFile • SequenceFile Compression • Codecs • Using Compression in MapReduce • Compression and Input Splits ■ Spark Introduction • GraphX • MLlib • Spark SQL • Data Processing Applications • Spark Streaming Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 6. ■ ■ • What Is Apache Spark? • Data Science Tasks • Spark Core • Storage Layers for Spark • Who Uses Spark, and for What? • A Unified Stack • Cluster Managers RDDs • Lazy Evaluation • Common Transformations and Actions • Passing Functions to Spark • Creating RDDs • RDD Operations • Actions • Transformations • Scala • Java • Persistence • Python • Converting Between RDD Types • RDD Basics • Basic RDDs RDD Internals:Part-1 • Expressing Existing Programming Models • Fault Recovery • Interpreter Integration • Memory Management • Implementation • MapReduce • RDD Operations in Spark • Iterative MapReduce • Console Log Minning Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 7. ■ ■ • Google's Pregel • User Applications Built with Spark • Behavior with Insufficient Memory • Support for Checkpointing • A Fault-Tolerant Abstraction • Evaluation • Job Scheduling • Spark Programming Interface • Advantages of the RDD Model • Understanding the Speedup • Leveraging RDDs for Debugging • Iterative Machine Learning Applications • Explaining the Expressivity of RDDs • Representing RDDs • Applications Not Suitable for RDDs RDD Internals:Part-2 • Sorting Data • Operations That Affect Partitioning • Determining an RDD’s Partitioner • Grouping Data • Motivation • Aggregations • Data Partitioning (Advanced) • Actions Available on Pair RDDs • Joins • Creating Pair RDDs • Operations That Benefit from Partitioning • Transformations on Pair RDDs • Example: PageRank • Custom Partitioners Data ingress and egress • Hadoop Input and Output Formats Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 8. • File Formats • Local/“Regular” FS • Text Files • Java Database Connectivity • Structured Data with Spark SQL • Elasticsearch • File Compression • Apache Hive • Cassandra • Object Files • Comma-Separated Values and Tab-Separated Values • HBase • Databases • Filesystems • SequenceFiles • JSON • HDFS • Motivation • JSON • Amazon S3 ■ Running on a Cluster • Scheduling Within and Between Spark Applications • Spark Runtime Architecture • A Scala Spark Application Built with sbt • Packaging Your Code and Dependencies • Launching a Program • A Java Spark Application Built with Maven • Hadoop YARN • Deploying Applications with spark-submit • The Driver • Standalone Cluster Manager • Cluster Managers Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 9. • Executors • Amazon EC2 • Cluster Manager • Dependency Conflicts • Apache Mesos • Which Cluster Manager to Use? ■ Spark Internals ■ Spark:YARN Mode • Resource Manager • Node Manager • Workers • Containers • Threads • Task • Executers • Application Master • Multiple Applications • Tuning Parameters ■ Spark:LocalMode ■ Spark Caching • With Serialization • Off-heap • In Memory ■ Running on a Cluster • Scheduling Within and Between Spark Applications • Spark Runtime Architecture • A Scala Spark Application Built with sbt • Packaging Your Code and Dependencies • Launching a Program • A Java Spark Application Built with Maven • Hadoop YARN Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 10. • Deploying Applications with spark-submit • The Driver • Standalone Cluster Manager • Cluster Managers • Executors • Amazon EC2 • Cluster Manager • Dependency Conflicts • Apache Mesos • Which Cluster Manager to Use? ■ ■ Spark Serialization StandAlone Mode • Task • Multiple Applications • Executers • Tuning Parameters • Workers • Threads ■ Advanced Spark Programming • Working on a Per-Partition Basis • Optimizing Broadcasts • Accumulators • Custom Accumulators • Accumulators and Fault Tolerance • Numeric RDD Operations • Piping to External Programs • Broadcast Variables ■ Spark Streaming • Stateless Transformations • Output Operations • Checkpointing • Core Sources Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 11. ■ ■ • Receiver Fault Tolerance • Worker Fault Tolerance • Stateful Transformations • Batch and Window Sizes • Architecture and Abstraction • Performance Considerations • Streaming UI • Driver Fault Tolerance • Multiple Sources and Cluster Sizing • Processing Guarantees • A Simple Example • Input Sources • Additional Sources • Transformations Spark SQL • User-Defined Functions • Long-Lived Tables and Queries • Spark SQL Performance • Apache Hive • Loading and Saving Data • Initializing Spark SQL • Parquet • Performance Tuning Options • SchemaRDDs • Caching • JSON • From RDDs • Linking with Spark SQL • Spark SQL UDFs • Using Spark SQL in Applications • Basic Query Example Tuning and Debugging Spark Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 12. • Driver and Executor Logs • Memory Management • Finding Information • Configuring Spark with SparkConf • Key Performance Considerations • Components of Execution: Jobs, Tasks, and Stages • Spark Web UI • Hardware Provisioning • Level of Parallelism • Serialization Format ■ Kafka Internals ■ ■ ■ Kafka Core Concepts • brokers • Topics • producers • replicas • Partitions • consumers Operating Kafka • P&S tuning • monitoring • deploying • Architecture • hardware specs Developing Kafka apps • serialization • compression • testing • Case Study • reading from Kafka • Writing to Kafka Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com
  • 13. ■ Storm Internals ■ Developing Storm apps • Case Studies • Bolts and topologies • P&S tuning • serialization • testing • Kafka integration ■ ■ Storm core concepts • spouts • groupings • tuples • bolts • parallelism • Topologies Operating Storm • monitoring • Architecture • deploying • hardware specs Mobile: +91 7719882295/ 9730463630 Email: sales@anikatechnologies.com Website:www.anikatechnologies.com