All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
Why we need Database Awareness?
Document vs Relational
Row-based vs Column-based
In-memory Database vs In-memory Data grids
Graph
Time-series
Solr vs ElasticSearch
Event Store
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
Purpose-built databases and platforms have actually created more complexity, effort, and unnecessary reinvention. The status quo is a big mess. TileDB took the opposite approach.
In this presentation, Stavros, the original creator of TileDB, shared the underlying principles of the TileDB universal database built on multi-dimensional arrays, making the case for it as a true first in the data management industry.
An overview about several technologies which contribute to the landscape of Big Data.
An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online
Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.
All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
Why we need Database Awareness?
Document vs Relational
Row-based vs Column-based
In-memory Database vs In-memory Data grids
Graph
Time-series
Solr vs ElasticSearch
Event Store
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
Purpose-built databases and platforms have actually created more complexity, effort, and unnecessary reinvention. The status quo is a big mess. TileDB took the opposite approach.
In this presentation, Stavros, the original creator of TileDB, shared the underlying principles of the TileDB universal database built on multi-dimensional arrays, making the case for it as a true first in the data management industry.
An overview about several technologies which contribute to the landscape of Big Data.
An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online
Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.
Slides from the August 2021 St. Louis Big Data IDEA meeting from Sam Portillo. The presentation covers AWS EMR including comparisons to other similar projects and lessons learned. A recording is available in the comments for the meeting.
Here I talk about examples and use cases for Big Data & Big Data Analytics and how we accomplished massive-scale sentiment, campaign and marketing analytics for Razorfish using a collecting of database, Big Data and analytics technologies.
Definitive Guide to Select Right Data Warehouse (2020)Sprinkle Data Inc
Choosing the right data warehouse is a big challenge for organisations. In this doc, we have made an end to end comparison of leading data warehouses. Snowflake vs Redshift vs BigQuery vs Hive vs Athena
Sprinkledata.com
In this webinar Thomas Cook, Sales Director, AnzoGraph DB, provides a history lesson on the origins of SPARQL, including its roots in the Semantic Web, and how linked open data is used to create Knowledge Graphs. Then, he dives into "What is RDF?", "What is a URI?" and "What is SPARQL?", wrapping up with a real-world demonstration via a Zeppelin notebook.
Big Data in the Cloud with Azure Marketplace ImagesMark Kromer
Here are some of the trends that I'm seeing from customer looking to build Azure-based Cloud Big Data solutions using images from the Azure Marketplace
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...Avinash Ramineni
Enterprises have been rapidly adopting data lakes as a complement or replacement of data warehouses. Many of the Data lake implementations are ignoring the inherent drawbacks and limitations of Data Lakes and ending up as data swamps with little or no benefit to the businesses. In this session we will go through some of challenges and the key aspects that need to be considered for successful Data lake implementations.
This is a presentation by Peter Coppola, VP of Product and Marketing at Basho Technologies and Matthew Aslett, Research Director at 451 Research. Join them as they discuss whether multi-model databases and polyglot persistence have increased operational complexity. They'll discuss the benefits and importance of NoSQL databases and how the Basho Data Platform helps enterprises leverage Big Data applications.
I have presented on AWS Big Data Analytics technologies and discussed on how AWS provides a big data platform that allows you to collect, store, and analyze data, how to use AWS services for Data Streaming and Big Data along with some demos on how to build big data solutions using Amazon EMR and Amazon Redshift in a step-by-step manner.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Introduces the Microsoft’s Data Platform for on premise and cloud. Challenges businesses are facing with data and sources of data. Understand about Evolution of Database Systems in the modern world and what business are doing with their data and what their new needs are with respect to changing industry landscapes.
Dive into the Opportunities available for businesses and industry verticals: the ones which are identified already and the ones which are not explored yet.
Understand the Microsoft’s Cloud vision and what is Microsoft’s Azure platform is offering, for Infrastructure as a Service or Platform as a Service for you to build your own offerings.
Introduce and demo some of the Real World Scenarios/Case Studies where Businesses have used the Cloud/Azure for creating New and Innovative solutions to unlock these potentials.
Slides from the August 2021 St. Louis Big Data IDEA meeting from Sam Portillo. The presentation covers AWS EMR including comparisons to other similar projects and lessons learned. A recording is available in the comments for the meeting.
Here I talk about examples and use cases for Big Data & Big Data Analytics and how we accomplished massive-scale sentiment, campaign and marketing analytics for Razorfish using a collecting of database, Big Data and analytics technologies.
Definitive Guide to Select Right Data Warehouse (2020)Sprinkle Data Inc
Choosing the right data warehouse is a big challenge for organisations. In this doc, we have made an end to end comparison of leading data warehouses. Snowflake vs Redshift vs BigQuery vs Hive vs Athena
Sprinkledata.com
In this webinar Thomas Cook, Sales Director, AnzoGraph DB, provides a history lesson on the origins of SPARQL, including its roots in the Semantic Web, and how linked open data is used to create Knowledge Graphs. Then, he dives into "What is RDF?", "What is a URI?" and "What is SPARQL?", wrapping up with a real-world demonstration via a Zeppelin notebook.
Big Data in the Cloud with Azure Marketplace ImagesMark Kromer
Here are some of the trends that I'm seeing from customer looking to build Azure-based Cloud Big Data solutions using images from the Azure Marketplace
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...Avinash Ramineni
Enterprises have been rapidly adopting data lakes as a complement or replacement of data warehouses. Many of the Data lake implementations are ignoring the inherent drawbacks and limitations of Data Lakes and ending up as data swamps with little or no benefit to the businesses. In this session we will go through some of challenges and the key aspects that need to be considered for successful Data lake implementations.
This is a presentation by Peter Coppola, VP of Product and Marketing at Basho Technologies and Matthew Aslett, Research Director at 451 Research. Join them as they discuss whether multi-model databases and polyglot persistence have increased operational complexity. They'll discuss the benefits and importance of NoSQL databases and how the Basho Data Platform helps enterprises leverage Big Data applications.
I have presented on AWS Big Data Analytics technologies and discussed on how AWS provides a big data platform that allows you to collect, store, and analyze data, how to use AWS services for Data Streaming and Big Data along with some demos on how to build big data solutions using Amazon EMR and Amazon Redshift in a step-by-step manner.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Introduces the Microsoft’s Data Platform for on premise and cloud. Challenges businesses are facing with data and sources of data. Understand about Evolution of Database Systems in the modern world and what business are doing with their data and what their new needs are with respect to changing industry landscapes.
Dive into the Opportunities available for businesses and industry verticals: the ones which are identified already and the ones which are not explored yet.
Understand the Microsoft’s Cloud vision and what is Microsoft’s Azure platform is offering, for Infrastructure as a Service or Platform as a Service for you to build your own offerings.
Introduce and demo some of the Real World Scenarios/Case Studies where Businesses have used the Cloud/Azure for creating New and Innovative solutions to unlock these potentials.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
Vast volume of our processed data is Time Series data and once you start working with distributed systems, you start tackling many scale and performance problems: How to handle missing data?Should I handle both serving and backed process or separating them out? Best Performance for Money? In the talk we will tell the tale of all of the transformations we’ve made to our data model@Windward, some of the problems we’ve handled, review the multiple data persistency layers like: S3, MongoDB, Apache Cassandra, MySQL. And I’ll try my best NOT to answer the question “Which one of them is the Best?"
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
Knolx was about to spark structured streaming. The focus was about the difference between three APIs RDD, DataFrame, and Datasets. And some key concepts of structured streaming like schema, output modes, operations like selection, projection, aggregation, windowing, etc
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
This presentation aims to be useful by covering the following topics:
- Modern Data Processing System Architectures and Models,
- Batch and Stream Processing Pipelines' details,
- Apache Spark Architecture and Internals,
- Real life use cases used with Apache Spark.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
Spark RDDs are almost identical to Scala collection, just in a distributed manner, all of the transformations and actions are derived from the Scala collections API.
As Martin Odersky mentioned, “Spark - The Ultimate Scala Collections” is the right way to look at RDDs. But with that great distributed power comes a great many data problems: at first you’ll start tackling the concept of partitioning, then the actual data becomes the next thing to worry about.
In the talk we’ll go through an overview on Spark's architecture, and see how similar RDDs are to the Scala collections API. We'll then shift to the world of problems that you’ll be facing when using Spark for processing a vast volume of time-series data with multiple data stores (S3, MongoDB, Apache Cassandra, MySQL).
When you start tackling many scale and performance problems, many questions arise:
> How to handle missing data?
> Should the system handle both serving and backend processes, or should we separate them out?
> Which solution is cheaper?
> How do we get the best performance for money spent?
In the talk we will tell the tale of all of the transformations we’ve made to our data and review the multiple data persistency layers... and I’ll try my best NOT to answer the question “which persistency layer is the best?” but I do promise to share our pains and lessons learned!
An overview of Apache Spark and AWS Glue.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.
In this second part, we'll continue the Spark's review and introducing SparkSQL which allows to use data frames in Python, Java, and Scala; read and write data in a variety of structured formats; and query Big Data with SQL.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
3. Who uses Spark, and for what
Data Science
● Analyze and model the data
● Transforming the data into a
usable format
● Ad-hoc analysis, statistics,
machine learning
Data Processing
● Parallelize across clusters
● Hides the complexity of
distributed systems
○ Programming
○ Networking communication
○ Fault tolerance
10. Core data structures
● Immutable
● Lives in memory
● Strongly typed
● Operations
○ Transformations (lazy)
○ Actions
11. Transformations and Actions
● Transformations return new RDDs as results
They are lazy, their result is not immediately computed
● Actions compute a result based on an RDD, and either
returned or saved to a storage
They are eager, their result is immediately computed
16. What happens when an action is executed?
Driver
Worker Worker Worker
17. What happens when an action is executed?
The data is partitioned into different blocks
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
18. What happens when an action is executed?
Driver sends the code to be executed on
each block
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
19. What happens when an action is executed?
Read HDFS block
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
20. What happens when an action is executed?
Read HDFS block and cache the data
Process and send the result to the driver
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
21. What happens when an action is executed?
Driver combine the results / sum
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
22. What happens when an action is executed?
Process from cache
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
23. What happens when an action is executed?
Send the data back to the driver
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
27. Structured API Overview
Structured APIs apply to both batch and streaming
computation.
Core type of distributed collections:
● Datasets (typed) - checks schema at runtime
● DataFrames (untyped) - checks schema at compile time
● SQL tables and views
32. Basic Operations
● Schemas (schema-on-read)
Defines the column names and types of a DataFrames
● Columns and Expressions
Columns in Spark are similar to columns in a spreadsheet.
Expressions are the operations like select, manipulate, remove
columns
● Records and Rows
Each row is a single record represented as an object of type Row.
33. Data Sources
Read API structure
DataFrameReader.format(...).option(“key”, “value”).schema(...).load()
● CSV
● JSON
● Parquet
● ORC
● JDBC/ODBC connections
● Plain-text files
● and many, many others from community
49. What is a Key-Value Pair RDD?
● Any RDD whose elements are key-value pairs
○ Key-Value pair is a tuple with two components (key, value)
○ Different pair may have the same keys
○ Both keys and values can be of primitive or complex data
type
55. Grouping and Sorting on Pair RDDs
● Grouping values with the same key
○ Reorganizing data by a new key
○ Post-processing per-key groups
● Sorting values using keys
○ Generating special-purpose datasets
○ Generating reports that require ordering
60. Broadcast Variables
Main use cases:
● Application tasks across multiple stage need the same, relatively large and
immutable dataset
● Application tasks need the same, relatively large and immutable dataset cached
in deserialized form
61. Accumulators
Main use cases:
● Counting and summation
● Application needs to compute multiple aggregates on the same dataset
● Application needs custom aggregation not supported by existing Spark operations
69. Distributed Collection of Partitions
● Spark automatically partitions RDDs
● Spark automatically distributes partitions among nodes
70. RDD Partitioning Properties
Number of partitions
Property Description
partitions Returns an array with all partition references for the sources RDD
partitions.size Returns a number of partitions in the source RDD
Partitioner
Property Description
partitioner Returns an Option[Partitioner] for the source RDD
Partitioner can refer to HashPartitioner, RangePartitioner or custom
71. Partitioning and Computation
● Partition is the smallest unit of data
● Task is the smallest unit of computation
● Number or partitions = Number of tasks
72. ● Number of partitions
○ Affects a number of tasks and the level of parallelism
○ Goal: balancing task execution and scheduling times
● Partitioner
○ Affects key-based operations
○ Goal: Avoiding shuffling the same dataset multiple times
Controlling Partitioning
73. Partitioning Rules
● Parallelizing a Scala Collection
○ partitions size = defaultParallelism
○ partitioner = None
● Reading data from HDFS
○ partitions size = max(number of file blocks or defaultParallelism)
○ partitioner = None
● Retrieving data from Cassandra
○ partitions size = max(data-size / 64 MBs or defaultParallelism)
○ partitioner = None