The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
In this session you will learn:
1. History of hadoop
2. Hadoop Ecosystem
3. Hadoop Animal Planet
4. What is Hadoop?
5. Distinctions of hadoop
6. Hadoop Components
7. The Hadoop Distributed Filesystem
8. Design of HDFS
9. When Not to use Hadoop?
10. HDFS Concepts
11. Anatomy of a File Read
12. Anatomy of a File Write
13. Replication & Rack awareness
14. Mapreduce Components
15. Typical Mapreduce Job
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
In this session you will learn:
1. History of hadoop
2. Hadoop Ecosystem
3. Hadoop Animal Planet
4. What is Hadoop?
5. Distinctions of hadoop
6. Hadoop Components
7. The Hadoop Distributed Filesystem
8. Design of HDFS
9. When Not to use Hadoop?
10. HDFS Concepts
11. Anatomy of a File Read
12. Anatomy of a File Write
13. Replication & Rack awareness
14. Mapreduce Components
15. Typical Mapreduce Job
In this session you will learn:
1. Meet MapReduce
2. Word Count Algorithm – Traditional approach
3. Traditional approach on a Distributed System
4. Traditional approach – Drawbacks
5. MapReduce Approach
6. Input & Output Forms of a MR program
7. Map, Shuffle & Sort, Reduce Phase
8. WordCount Code walkthrough
9. Workflow & Transformation of Data
10. Input Split & HDFS Block
11. Relation between Split & Block
12. Data locality Optimization
13. Speculative Execution
14. MR Flow with Single Reduce Task
15. MR flow with multiple Reducers
16. Input Format & Hierarchy
17. Output Format & Hierarchy
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
These slides cover the very basics of Hadoop architecture, in particular HDFS. This was my presentation in the first Delhi Hadoop User Group (DHUG) meetup held at Gurgaon on 10th September 2011. Loved the positive feedback. I'll also upload a more elaborate version covering Hadoop mapreduce architecture as well soon. Most of the stuff covered in these slides can be found in Tom White's book as well (See the last slide)
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
This is an updated version of Amr's Hadoop presentation. Amr gave this talk recently at NASA CIDU event, TDWI LA Chapter, and also Netflix HQ. You should watch the powerpoint version as it has animations. The slides also include handout notes with additional information.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX
In this session you will learn:
1. Meet MapReduce
2. Word Count Algorithm – Traditional approach
3. Traditional approach on a Distributed System
4. Traditional approach – Drawbacks
5. MapReduce Approach
6. Input & Output Forms of a MR program
7. Map, Shuffle & Sort, Reduce Phase
8. WordCount Code walkthrough
9. Workflow & Transformation of Data
10. Input Split & HDFS Block
11. Relation between Split & Block
12. Data locality Optimization
13. Speculative Execution
14. MR Flow with Single Reduce Task
15. MR flow with multiple Reducers
16. Input Format & Hierarchy
17. Output Format & Hierarchy
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
These slides cover the very basics of Hadoop architecture, in particular HDFS. This was my presentation in the first Delhi Hadoop User Group (DHUG) meetup held at Gurgaon on 10th September 2011. Loved the positive feedback. I'll also upload a more elaborate version covering Hadoop mapreduce architecture as well soon. Most of the stuff covered in these slides can be found in Tom White's book as well (See the last slide)
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
This is an updated version of Amr's Hadoop presentation. Amr gave this talk recently at NASA CIDU event, TDWI LA Chapter, and also Netflix HQ. You should watch the powerpoint version as it has animations. The slides also include handout notes with additional information.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them.
Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code.
Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.
1. Big Data Analytics
- Big Data
- Spark: Big Data Analytics
- Resilient Distributed Datasets (RDD)
- Spark libraries (SQL, DataFrames, MLlib for machine learning, GraphX, and Streaming)
- PFP: Parallel FP-Growth
2. Ubiquitous Computing
- Edge Computing
- Cloudlet
- Fog computing
- Internet of Things (IoT)
- Virtualization
- Virtual Conferencing
- Virtual Events (2D, 3D, and Hybrid)
The critical thing to remember about Spark and Hadoop is they are not mutually exclusive or inclusive but they work well together and makes the combination strong enough for lots of big data applications.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Hadoop is getting replaced with Scala.The basic reason behind that is Scala is 100 times faster than Hadoop MapReduce so the task performed on Scala is much faster and efficient than Hadoop.
INTRODUCTION
3NF and BCNF
Decomposition requirements
Lossless join decomposition
Dependency preserving decomposition
Disk pack features
Records and Files
Ordered and Unordered files
2NF,NF,3NF,BCNF
INTRODUCTION
Relational Query Languages
Formal Query Languages
Introduction to relational algebra
Set Operators Join operator
Aggregate functions, Grouping
Relational Calculus concepts
Introduction to Structured Query Language (SQL)
Features of SQL, DDL Statements
Database Systems
DBMS
Database System Environment
Traditional File Systems
Advantages of DBMS over File Systems
Disadvantages of DBMS
DBMS
Describing and Storing data in DBMS
Three Schema Architecture
Data Independence
Queries
Transactions
Structure of DBMS
Users of DBMS
Steps in database Design Process
ER Concepts and Notations
Class Hierarchies
1 Planning the Computer Program
2 Uses of Algorithm
3 Flow Charts
4 Pseudo code Applications: To produce an ordered sequence of steps, that describe solution of a problem.
1.History of C Language, Structure of a C program, Statements, Basic Data Types, Variables &Constants, Input & Output statements, Operators and Precedence, Expressions, Simple C programs.
Memory Hierarchy
RAM
Memory Chip Organization
ROM
Flash Memory
Types of Programming Languages
Compiler vs Interpreter vs Assembler
Types of programming languages
Compiler vs interpreter vs assembler
high level language vs assembly level language vs low level language
1.1Explain types of Input Devices (Keyboard, Mouse, Pen, and Touch Screen Scanners, Output Devices (Monitor, printer, Speakers, Projectors) and of Storage Devices (Hard Disks, CD-ROMS, DVD-ROMS, USB Storage)[D] Operate computer and its peripherals
1.2 Booting the computer. Common start-up errors and their remedies.
Connecting peripherals – keyboard, mouse, monitor, power cables,
UPS to the computer and checking all connections. Demonstrate procedure for the installation of setting up a new computer along with other peripherals (keyboard, scanner, printer)[M]
1.3Demonstrate Keyboard layout and functions of different keys.[M]
1.4Demonstrate Proper shut down of PC, and explain precautions to avoid an improper shut down.[M]
1.5Identifying the different hardware parts in the PC.[M]
1.6Determining the configuration of the PC.[M]
1.7 Explain types of Central Processing Unit (Processors, RAM, ROM)[M]
1.8 Demonstrate procedure for installation /
replacement / maintenance procedures for hard disk and other peripherals.[D]
Introduction
Plotting basic 2-D plots.
The plot command
The fplot command
Plotting multiple graphs in the same plot
Formatting plots
USING THE plot() COMMAND TO PLOT
MULTIPLE GRAPHS IN THE SAME PLOT
MATLAB PROGRAM TO PLOT VI CHARACTERISTICS OF A DIODE
SUMMARY
Arrays
Array Creation , Accessing Elements
Sub Arrays, Representation, Operations
Maximum and Minimum values in Matrix
Potential Energy-Spring Problem
SUMMARY
An introduction to AI,ML,DL
Working of AI System
Scope of AI ,Cyber Security and BCT in Marine
Marine Education Scope of AI and BCT
Changes Required in Curriculum
Cyber security in Marine field
Parametric Analysis
Skill Set Requirement
Introduction
Overview of Loop statement
For loop
While loop
Nested loop
While loop vs for loop
prime number using matlab
armstrong number using matlab
special number using matlab
magic number using matlab
perfect number using matlab
pattern display number using matlab
palindrome number using matlab
fibonnacci series using matlab
MS word complete tutorials,Topics to be covered :
1. Create and save documentation.
2. Open, find, and rename files and folders.
3. Use “Formatting Toolbar”.
4. Use spelling and grammar checks in the document.
5. Use “Headers and Footers”.
6. Insert symbols and pictures.
7. Create tables in MS-Word.
8. Use formulas in MS –WORD Mail merge, Embedding Excel to WORD. Applications : To create a professional grade document.
Guidelines for ER to Relational Mapping.
Mapping rules/ guidelines for mapping various ER constructs to Relational model with appropriate examples
Relational Query Languages Formal Query Languages
Introduction to Relational Algebra
Relational operators
Set operators
Join operators
Aggregate functions.
Grouping operator
Relational Calculus concepts
Relational algebra queries for data retrieval with sample relational schemas. relational algebra operations.
What is Relational model
Characteristics
Relational constraints
Representation of schemas
characteristics and Constraints of Relational model with proper examples.
Updates and dealing with constraint violations in Relational model
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
2. Session objective
Introduction Quick Recap Spark
Machine Learning (on Spark)
Spark: Introduction, Architecture and Features
Spark: Resilient Distributed Datasets, Transformation, Examples
Regression, Classification, Collaborative Filtering, & Clustering.
SUMMARY
3. Shortcoming of MapReduce
Forces your data processing into Map and Reduce
Other workflows missing include join, filter, flatMap, groupByKey,
union, intersection, …
Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
Read and write to Disk before and after Map and Reduce (stateless
machine)
Not efficient for iterative tasks, i.e. Machine Learning
Only Java natively supported
Support for others languages needed
Only for Batch processing
Interactivity, streaming data
Introduction
4. One Solution is Apache Spark
A new general framework, which solves many of the short comings
of MapReduce
It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN,
HBase, S3, …
Has many other workflows, i.e. join, filter, flatMapdistinct,
groupByKey, reduceByKey, sortByKey, collect, count, first…
(around 30 efficient distributed operations)
In-memory caching of data (for iterative, graph, and machine
learning algorithms, etc.)
Native Scala, Java, Python, and R support
Supports interactive shells for exploratory data analysis
Spark API is extremely simple to use
Developed at AMPLab UC Berkeley, now by Databricks.com
Introduction Quick Recap Spark
5. expressive computing system, not limited to map-
reduce model
facilitate system memory
avoid saving intermediate results to disk
cache data for repetitive queries (e.g. for machine
learning)
compatible with Hadoop
Spark Ideas
6. Spark Uses Memory instead of Disk
Iteration
1
Iteration
2
HDFS read
Iteration
1
Iteration
2
HDFS
read
HDFS
Write
HDFS
read HDFS
Write
Spark: In-Memory Data Sharing
Hadoop: Use Disk for Data Sharing
Spark: Introduction
7. Sort competition
Hadoop MR
Record (2013)
Spark
Record (2014)
Data Size 102.5 TB 100 TB
Elapsed Time 72 mins 23 mins
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk
throughput
3150 GB/s
(est.)
618 GB/s
Network
dedicated data
center, 10Gbps
virtualized (EC2) 10Gbps
network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min
Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records),http://databricks.com/blog/2014/11/05/spark-
officially-sets-a-new-record-in-large-scale-sorting.html
Spark, 3x
faster with
1/10 the
nodes
Spark: Introduction
8. Apache Spark supports data analysis, machine learning, graphs, streaming
data, etc. It can read/write from a range of data types and allows
development in multiple languages.
Spark Core
Spark
Streaming
MLlib GraphX
ML Pipelines
Spark SQL
DataFrames
Data Sources
Scala, Java, Python, R, SQL
Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style
(GlusterFS, Lustre)
Spark: Introduction
12. Streaming Data
Machine Learning
Interactive Analysis
Data Warehousing
Batch Processing
Exploratory Data Analysis
Graph Data Analysis
Spatial (GIS) Data Analysis
And many more
Spark’s Main Use Cases
13. Fingerprint Matching
Developed a Spark based fingerprint minutia detection and fingerprint
matching code
Twitter Sentiment Analysis
Developed a Spark based Sentiment Analysis code for a Twitter dataset
Spark’s Main Use Cases
14. Uber – the online taxi company gathers terabytes of event data from its mobile
users every day.
By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline
Convert raw unstructured event data into structured data as it is collected
Uses it further for more complex analytics and optimization of operations
Pinterest – Uses a Spark ETL pipeline
Leverages Spark Streaming to gain immediate insight into how users all over
the world are engaging with Pins—in real time.
Can make more relevant recommendations as people navigate the site
Recommends related Pins
Determine which products to buy, or destinations to visit
Spark’s Main Use Cases
15. Here are Few other Real World Use Cases:
Conviva – 4 million video feeds per month
This streaming video company is second only to YouTube.
Uses Spark to reduce customer churn by optimizing video streams and managing live
video traffic
Maintains a consistently smooth, high quality viewing experience.
Capital One – is using Spark and data science algorithms to understand customers in a
better way.
Developing next generation of financial products and services
Find attributes and patterns of increased probability for fraud
Netflix – leveraging Spark for insights of user viewing habits and then recommends movies
to them.
User data is also used for content creation
Spark’s Main Use Cases
16. Even though Spark is versatile, that doesn’t mean Spark’s in-memory
capabilities are the best fit for all use cases:
For many simple use cases Apache MapReduce and Hive might be a
more appropriate choice
Spark was not designed as a multi-user environment
Spark users are required to know that memory they have is sufficient
for a dataset
Adding more users adds complications, since the users will have to
coordinate memory usage to run code
Spark: when not to use
17. Clouds and supercomputers are collections of
computers networked together in a
datacenter
Clouds have different networking, I/O, CPU
and cost trade-offs than supercomputers
Cloud workloads are data oriented vs.
computation oriented and are less closely
coupled than supercomputers
Principles of parallel computing same on both
Apache Hadoop and Spark vs. Open MPI
HPC and Big Data Convergence
18. RDDs (Resilient Distributed Datasets) is Data Containers
All the different processing components in Spark share
the same abstraction called RDD
As applications share the RDD abstraction, you can mix
different kind of transformations to create new RDDs
Created by parallelizing a collection or reading a file
Fault tolerant
Resilient Distributed Datasets (RDDs)
19. RDD abstraction
Resilient Distributed Datasets
partitioned collection of records
spread across the cluster
read-only
caching dataset in memory
different storage levels available
fallback to disk possible
Introduction Quick Recap Spark
20. RDD operations
transformations to build RDDs through deterministic
operations on other RDDs
transformations include map, filter, join
lazy operation
actions to return value or export data
actions include count, collect, save
triggers execution
Introduction Quick Recap Spark
21. Job example
val log = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.cache()
errors.filter(_.contains(“I/O”)).count()
errors.filter(_.contains(“timeout”)).count()
Driver
Worker Worker Worker
Block3Block1 Block2
Cache1 Cache2 Cache2
Action!
Introduction Quick Recap Spark
23. Job scheduling
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Introduction Quick Recap Spark
24. Available APIs
You can write in Java, Scala or Python
interactive interpreter: Scala & Python only
standalone applications: any
performance: Java & Scala are faster thanks to static
typing
Introduction Quick Recap Spark
25. Summary
concept not limited to single pass map-reduce
avoid soring intermediate results on disk or HDFS
speedup computations when reusing datasets
Introduction Quick Recap Spark
26. Summary
concept not limited to single pass map-reduce
avoid soring intermediate results on disk or HDFS
speedup computations when reusing datasets
Introduction Quick Recap Spark
27. DataFrames (DFs) is one of the other distributed datasets organized in
named columns
Similar to a relational database, Python Pandas Dataframe or R’s
DataTables
Immutable once constructed
Track lineage
Enable distributed computations
How to construct Dataframes
Read from file(s)
Transforming an existing DFs(Spark or Pandas)
Parallelizing a python collection list
Apply transformations and actions
DataFrames & SparkSQL
28. DataFrame example
// Create a new DataFrame that contains “students”
students = users.filter(users.age < 21)
//Alternatively, using Pandas-like syntax
students = users[users.age < 21]
//Count the number of students users by gender
students.groupBy("gender").count()
// Join young students with another DataFrame called logs
students.join(logs, logs.userId == users.userId,
“left_outer")
Resilient Distributed Datasets (RDDs)
29. RDDs provide a low level interface into Spark
DataFrames have a schema
DataFrames are cached and optimized by Spark
DataFrames are built on top of the RDDs and the
core Spark API
Example: performance
RDDs vs. DataFrames
30. Transformations
(create a new RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
intersection
flatMap
union
join
cogroup
cross
mapValues
reduceByKey
Actions
(return results to driver
program)
collect first
Reduce take
Count takeOrdered
takeSample countByKey
take save
lookupKey foreach
Spark Operations
33. Spark Workflow
FlatMap Map groupbyKey
Spark
Contex
t
Driver
Progra
m
Collect
Narrow Vs. Wide Transformation
34. What is an action
The final stage of the workflow
Triggers the execution of the DAG
Returns the results to the driver
Or writes the data to HDFS or to a file
Actions
35. Word count
text_file = sc.textFile("hdfs://usr/godil/text/book.txt")
counts = text_file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt")
Examples from http://spark.apache.org/
RDD API Examples
Logistic Regression
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()
37. Broadcast Variables and Accumulators
(Shared Variables )
Broadcast variables allow the programmer to keep a read-only
variable cached on each node, rather than sending a copy of it
with tasks
>broadcastV1 = sc.broadcast([1, 2, 3,4,5,6])
>broadcastV1.value
[1,2,3,4,5,6]
Accumulators are variables that are only “added” to through an
associative operation and can be efficiently supported in
parallel
accum = sc.accumulator(0)
accum.add(x)
accum.value
RDD Persistence and Removal
39. Regression
Logistic Regression is a popular supervised machine learning
algorithm which can be used predict a categorical response. It can
be used to solve under classification type machine learning
problems.
Classification involves looking at data and assigning a class (or a
label) to it. Usually there are more than one classes, when there are
two classes(0 or 1) it identifies as Binary Classification.
Logistic Regression algorithm is used to build a machine learning
model with Apache Spark.
The Logistic Regression model builds a Binary Classifier model to
predict student exam pass/fail result based on past exam scores.
40. Regression
Logistic Regression is a popular supervised machine learning
algorithm which can be used predict a categorical response. It can
be used to solve under classification type machine learning
problems.
Classification involves looking at data and assigning a class (or a
label) to it. Usually there are more than one classes, when there are
two classes(0 or 1) it identifies as Binary Classification.
Logistic Regression algorithm is used to build a machine learning
model with Apache Spark.
The Logistic Regression model builds a Binary Classifier model to
predict student exam pass/fail result based on past exam scores.
41. π = Proportion of “Success”
In ordinary regression the model predicts the mean Y for any
combination of predictors.
What’s the “mean” of a 0/1 indicator variable?
success""ofProportion
trialsof#
'1of#
s
n
y
y i
Goal of logistic regression: Predict the “true” proportion of success, π, at
any value of the predictor.
Regression
42. Binary Logistic Regression Model
Y = Binary response X = Quantitative predictor
π = proportion of 1’s (yes,success) at any X
p =
e
b0 +b1 X
1+ e
b0 +b1 X
Equivalent forms of the logistic regression model:
What does this look like?
X10
1
log
Logit form Probability form
N.B.: This is natural log (aka “ln”)
Regression
43. y
0.2
0.4
0.6
0.8
1.0
x
-10 -8 -6 -4 -2 0 2 4 6 8 10 12
y =
bo b1 x•+( )exp
1 bo b1 x•+( )exp+
no data Function Plot
Logit Function
Regression