Running Spark and MapReduce together in Production

RUNNING SPARK AND MAPREDUCE
TOGETHER IN PRODUCTION
David Chaiken, CTO of Altiscale
chaiken@altiscale.com
#HadoopSherpa

2
AGENDA
• Why run MapReduce and Spark together in production?
• What about H2O, Impala, and other memory-intensive
frameworks?
• Batch + Interactive = Challenges
• Specific issues and solutions
• Ongoing Challenges: Keeping Things Running
• Perspective: Hadoop as a Service versus DIY*
* do it yourself

ALTISCALE PERSPECTIVE:
INFRASTRUCTURE NERDS
• Experienced Technical Yahoos
• Raymie Stata, CEO. Former Yahoo! CTO,
advocate of Apache Software Foundation
• David Chaiken, CTO.
Former Yahoo! Chief Architect
• Charles Wimmer, Head of Operations.
Former Yahoo! SRE
• Hadoop as a Service, built and managed by Big Data,
SaaS, and enterprise software veterans
• Yahoo!, Google, LinkedIn, VMWare, Oracle, ...
3

4
SOLVED:
COST-EFFECTIVE DATA SCIENCE AT SCALE
But how do you make it
easier for data scientists?
Two bad options:
1. Use Hadoop directly
using unfamiliar and
unproductive command-
line tools and APIs
2. Use Hadoop indirectly
via a back-and-forth with
data engineers who
translate needs into
Hadoop programs

Data Scientist’s Workflow
ModelingExploration
Production
Cleansing
Flattening
Serving
Hive
Source
Data
CSV
5
COMMON HADOOP WORKFLOW
Model
Flatten
Explore
Exploration

6
ENTER SPARK. . . AND IMPALA AND H2O
• Interactive, iterative
analysis
• Quick turns
• Memory heavy

7
DOES THIS MEAN THAT MAPREDUCE
DOESN’T MATTER ANYMORE?
HA!
(Don’t believe the hype.)

Exploration
Hive
Source
Data
CSV
IT MATTERS SO MUCH THAT YOU WANT BOTH
ON ONE CLUSTER.
Flattening
Exploration Modeling Serving
Production
Cleansing
BIG DATA MODELING WORKFLOW
8

THE CHALLENGE. . .
9
“Why is my
Spark job not
starting?”
“Why is my Spark
job consuming so
many resources?”
Resource
conflicts!
9

SPECIFIC ISSUES
AND SOLUTIONS
10

INTERACTIVE:
INCREASE CONTAINER SIZE
Challenge: Memory intensive systems take as much
local DRAM as available.
Solutions:
• Spark and H20: Increase YARN container memory size
• Impala: Box using operating system containers
11

• Caution: Larger YARN container settings for interactive
jobs may not be right for batch systems like Hive
• Container size: needs to combine vcores and memory:
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores ...
HIVE+INTERACTIVE:
WATCH OUT FOR LARGE CONTAINER SIZE
12

HIVE + INTERACTIVE:
WATCH OUT FOR FRAGMENTATION
• Caution: Attempting to schedule interactive systems and
batch systems like Hive may result in fragmentation
• Interactive systems may require all-or-nothing scheduling
• Batch jobs with little tasks may starve interactive jobs
13

HIVE + INTERACTIVE:
WATCH OUT FOR FRAGMENTATION
Solutions:
• Reserve interactive nodes before starting batch jobs
• Reduce interactive container size (if the algorithm permits)
• Node labels (YARN-2492) and gang scheduling (YARN-624)
14

ONGOING
CHALLENGES
Keeping things running. . .
15

16
CHALLENGE: SECURITY
• Challenge: User Management not uniform
• MapReduce: collaboration requires getting groups right
• Hive: proxyuser settings have to be right for hiveserver2
• Spark application owner versus connected users
• Impala: “I just gotta be me!”
• As usual, watch out for cluster administrator accounts!
• Challenge: Port and Protocol Management
• Best security practice: open specific ports for specific protocols
• Spark: “I just gotta be free!”
• Spark improved between version 1.0.2 -> 1.1.0,
but still confusing

17
CHALLENGE: WEB SERVING
• How to provide interactive services to business user?
• Concerns: security, variable resources, latency, availability
• Keep serving infrastructure separate from Hadoop

18
CHALLENGE:
RESOURCE ATTRIBUTION (BILLING)
• Accounting for long-running Spark, H2O, Impala clusters?
• Is reserving resources the same as using the resources?
• Trade-off: availability/response time vs. oversubscription.

19
CHALLENGE:
STABILITY VERSUS AGILITY
• Never-ending story: latest hotness versus SLAs*
• New system stability curve. Example…
• SPARK-1476: 2GB limit in Spark for blocks
• Interoperation issues. Example…
• IMPALA-1416: Queries fail with metastore exception after
upgrade and compute stat
• HIVE-8627: Compute stats on a table from Impala caused the
table to be corrupted
• Many issues come down to YARN container size and
JVM heap size configuration
* service level agreements

20
PERSPECTIVE: HADOOP AS A SERVICE
VERSUS DIY (DO IT YOURSELF)
• Data Scientists and Data Engineers:
use the right tools for the right job
• Data Scientists and Data Engineers:
don’t spend your time on cluster maintenance
• Hadoop As A Service: have your cake and eat it, too
• Benefit from the experiences of other customers
• One size does not fit all, but one configuration schema does
• Leave the maintenance to us infrastructure nerds

Running Spark and MapReduce together in Production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Running Spark and MapReduce together in Production

Similar to Running Spark and MapReduce together in Production (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Running Spark and MapReduce together in Production

Editor's Notes