Data Engineering with Spring, Hadoop and Hive

Faster Data Flows with Hive, Spring and Hadoop
Alex Silva

Principal Data Engineer

DATA ADVENTURES AT RACKSPACE
• Datasets
• Data pipeline: ﬂows and systems
• Creating a generic Hadoop ETL framework
• Integrating Hadoop with Spring
• Spring Hadoop, Spring Bach and Spring Boot
• Hive
• File formats
• Queries and performance

MAAS Dataset
• System and platform monitoring
• Pings, SSH, HTTP, HTTPS checks
• Remote monitoring
• CPU, ﬁle system, load average, disk memory
• MySQL, Apache
THE BUSINESS DOMAIN | 3

The Dataset
• Processing around 1.5B records/day
• Stored in Cassandra
• Exported to HDFS in batches
• TBs of uncompressed JSON (“raw data”) daily
• First dataset piped through ETL platform
DATA ENGINEERING STATS | 4

DATA PIPELINE
• Data ﬂow
• Stages
• ETL
• Input formats
• Generic Transformation Layer
• Outputs

Data Flow Diagram
DATA FLOW | 6
Monitoring
JSON Export
HDFS
Start
Available and
well-formed?
No
Stop
EXTRACT AND TRANSFORM
BAD ROW
OR ERROR?
LOG
CSV
STAGING
FILE
ETL
JSON DATA
HDFSYes
Yes No
LOAD
Partioning
Bucketing
Indexing
Staging Table Production Table
ETL
Hive Table
Flume

Systems Diagram
SYSTEMS | 7
Monitoring
Events
HDFS
JSON
Extract
MapReduce
1.2.0.1.3.2.0
Load
Hive
0.12.0
Flume Log4J
Appender
Flume
1.5.0
Access
End User
Bad records sink
Export

ETL Summary
• Extract
• JSON ﬁles in HDFS
• Transform
• Generic Java based ETL framework
• MapReduce jobs extract features
• Quality checks
• Load
• Load data into partitioned ORC Hive tables
DATA FLOW | 8

Hadoop: Pros
• Dataset volume
• Data is grows exponentially at a very rapid rate
• Integrates with existing ecosystem
• HiveQL
• Experimentation and exploration
• No expensive software or hardware to buy
TOOLS AND TECHNOLOGIES | 10

Hadoop: Cons
• Job monitoring and scheduling
• Data quality
• Error handling and notiﬁcation
• Programming model
• Generic framework mitigates some of that

CAN WE OVERCOME SOME OF THOSE?

Keeping the Elephant “Lean”
• Job control without the complexity of external tools
• Checks and validations
• Uniﬁed conﬁguration model
• Integration with scripts
• Automation
• Job restartability
DATA ENGINEERING | 13

What is it about?
• Part of the Spring Framework
• Run Hadoop apps as standard Java apps using DI
• Uniﬁed declarative conﬁguration model
• APIs to run MapReduce, Hive, and Pig jobs.
• Script HDFS operations using any JVM based languages.
• Supports both classic MR and YARN

The Apache Hadoop Namespace
Also supports annotation based conﬁguration via the
@EnableHadoop annotation.

Job Conﬁguration: Standard Hadoop APIs
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
Job.setJarByClass(WordCountMapper.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
job.waitForCompletion(true);

Conﬁguring Hadoop with Spring
SPRING HADOOP | 19
<context:property-placeholder location="hadoop-dev.properties"/>
<hdp:configuration>
fs.default.name=${hd.fs}
</hdp:configuration>
<hdp:job id="word-count-job"
input-path=“${input.path}"
output-path="${output.path}“
jar=“hadoop-examples.jar”
mapper="examples.WordCount.WordMapper“
reducer="examples.WordCount.IntSumReducer"/>
<hdp:job-runner id=“runner” job-ref="word-count-job“
run-at-startup=“true“ />
input.path=/wc/input/-
output.path=/wc/word/-
hd.fs=hdfs://localhost:9000

SPRING HADOOP | 20
Conﬁguration Attributes

Creating a Job
SPRING HADOOP | 21

Injecting Jobs
• Use DI to obtain reference to Spring managed Hadoop
job
• Perform additional validation and conﬁguration before
submitting
public'class'WordService'{'
'
''@Autowired'
''private'Job'mapReduceJob;'''
'
''public'void'processWords()'{'''''
''''mapReduceJob.submit();'
''}'
}'

Running a Job

Distributed Cache

Using Scripts

Scripting Implicit Variables

Scripting Support in HDFS
• FSShell is designed to support scripting languages
• Use these for housekeeping tasks:
• Check for ﬁles, prepare input data, clean output
directories, set ﬂags, etc.

What is it about?
• Born out of collaboration with Accenture in 2007
• Fully automated processing of large volumes of data.
• Logging, txn management, listeners, job statistics,
restart, skipping, and resource management.
• Automatic retries after failure
• Synch, async and parallel processing
• Data partitioning

Hadoop Workflow Orchestration
• Complex data flows
• Reuses batch infrastructure to manage Hadoop workflows.
• Steps can be any Hadoop job type or HDFS script
• Jobs can be invoked by events or scheduled.
• Steps can be sequential, conditional, split,  
concurrent, or programmatically determined.
• Works with flat files, XML, or databases.

Spring Batch Conﬁguration
• Jobs are composed of steps
<job id="job1">
<step id="import" next="wordcount">
<tasklet ref=“import-tasklet"/>
</step>
<step id=“wc" next="pig">
<tasklet ref="wordcount-tasklet"/>
</step>
<step id="pig">
<tasklet ref="pig-tasklet“></step>
<split id="parallel" next="hdfs">
<flow><step id="mrStep">
<tasklet ref="mr-tasklet"/>
</step></flow>
<flow><step id="hive">
<tasklet ref="hive-tasklet"/>
</step></flow>
</split>
<step id="hdfs">
<tasklet ref="hdfs-tasklet"/></step>
</job>

Spring Data Hadoop Integration

What is it about?
• Builds production-ready Spring applications.
• Creates a “runnable” jar with dependencies and classpath settings.
• Can embed Tomcat or Jetty within the JAR
• Automatic configuration
• Out of the box features:
• statistics, metrics, health checks and externalized configuration
• No code generation and no requirement for XML configuration.

Spring Data Flow Components
Spring Boot
Extract
Spring Batch
2.0
Load
Spring Hadoop
2.01.1.5
HDFS
Hive
0.12.0
MapReduce
HDP 1.3

Hierarchical View
Spring Boot
Spring Batch
Job control
Spring Hadoop
- Notiﬁcations
- Validation
- Scheduling
- Data Flow
- Callbacks

Spring Hadoop Configuration
• Job parameters configured by Spring
• Sensible defaults used
• Parameters can be overridden:
• External properties file.
• At runtime via system properties: -Dproperty.name = property.value
<configuration>
fs.default.name=${hd.fs}
io.sort.mb=${io.sort.mb:640mb}
mapred.reduce.tasks=${mapred.reduce.tasks:1}
mapred.job.tracker=${hd.jt:local}
mapred.child.java.opts=${mapred.child.java.opts}
</configuration>

MapReduce Jobs
• Conﬁgured via Spring Hadoop
• One job per entity
<job id="metricsMR"
input-path="${mapred.input.path}"
output-path="${mapred.output.path}"
mapper="GenericETLMapper"
reducer="GenericETLReducer”
input-format="org.apache.hadoop.mapreduce.lib.input.TextInputFormat"
output-format="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat"
key="TextArrayWritable"
value="org.apache.hadoop.io.NullWritable"
map-key="org.apache.hadoop.io.Text"
map-value="org.apache.hadoop.io.Text"
jar-by-class="GenericETLMapper">
volga.etl.dto.class=Metric
</job>

MapReduce Jobs
• Jobs are wrapped into Tasklet deﬁnitions
<job-tasklet job-ref="metricsMR" id="metricsJobTasklet"/>

Hive Conﬁguration
• Hive steps also deﬁned as tasklets
• Parameters are passed from MapReduce phase to Hive
phase
<hive-client-factory host="${hive.host}" port="${hive.port:10000}"/>
<hive-tasklet id="load-notifications">
<script location="classpath:hive/ddl/notifications-load.hql"/>
</hive-tasklet>
<hive-tasklet id="load-metrics">
<script location="classpath:hive/ddl/metrics-load.hql">
<arguments>INPUT_PATH=${mapreduce.output.path}</arguments>
</script>
</hive-tasklet>

Spring Batch Conﬁguration
• One Spring Batch job per entity.
<job id="metrics" restartable="false" parent="VolgaETLJob">
<step id="cleanMetricsOutputDirectory" next="metricsMapReduce">
<tasklet ref="setUpJobTasklet"/>
</step>
<step id="metricsMapReduce">
<tasklet ref="metricsJobTasklet">
<listeners>
<listener ref="mapReduceErrorThresholdListener"/>
</listeners>
</tasklet>
<fail on="FAILED" exit-code="Map Reduce Step Failed"/>
<end on="COMPLETED"/>

</step>
<step id="loadMetricsIntoHive">
<tasklet ref="load-notifications"/>
</step>
</job>

Spring Batch Listeners
• Monitor job ﬂow
• Take action on job failure
• PagerDuty notiﬁcations
• Save job counters to the audit database
• Notify team if counters are not consistent with historical
audit data (based on thresholds)

Spring Boot: Pulling Everything Together
• Runnable jar created during build process
• Controlled by Maven plugin
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<finalName>maas-etl-${project.version}</finalName>
<classifier>spring</classifier>
<mainClass>com.rackspace....JobRunner</mainClass>
<excludeGroupIds>org.slf4j</excludeGroupIds>
</configuration>
</plugin>

HIVE
• Typical Use Cases
• File formats
• ORC
• Abstractions
• Hive in the monitoring pipeline
• Query performance

Overview
• Translates SQL commands into MR jobs.
• Structured and unstructured data in multiple formats
• Standard access protocols, including JDBC and Thrift
• Provides several serialization mechanisms
• Integrates seamlessly with Hadoop: HCatalog, Pig,
HBase, etc.
HIVE | 47

Hive vs. RDBMS
HIVE | 48
Hive Traditional Databases
SQL Interface SQL Interface
Focus on batch analytics Mostly online, interactive analytics
No transactions Transactions are their way of life
No random inserts 
Updates are not natively supported (but possible.)
Random insert and updates
Distributed processing via MR Distributed processing capabilities vary
Scales to hundreds of nodes Seldom scales beyond 20 nodes
Built for commodity hardware Expensive, proprietary hardware
Low cost per petabyte What’s a petabyte?

Abstraction Layers in Hive
49HIVE |
Database
Table
Partition
Skewed Keys
Table
Partition Partition
Unskewed
Keys
B
u
c
k
e
t
B
u
c
k
e
t
B
u
c
k
e
t
Optional

Schemas and File Formats
• We used the ORCFile format: built-in, easy to use and efficient.
• Efficient light-weight + generic compression
• Run length encoding for integers and strings, dictionary encoding, etc.
• Generic compression: Snappy, LZO, and ZLib (default)
• High performance
• Indexes value ranges within blocks of ORCFile data
• Predicate filter pushdown allows efficient scanning during queries.
• Flexible Data Model
• Hive types are supported including maps, structs and unions.
HIVE | 50

The ORC File Format
• An ORC file contains groups of row data called stripes,
along with auxiliary information in a file footer.
• Default size is 256 MB (orc.stripe.size).
• Large stripes allow for efficient reads from HDFS
configured independently from the block size.
HIVE | 51

The ORC File Format: Index
• Doesn’t answer queries
• Required for skipping rows:
• Row index entries provide offsets that enable seeking
• Min and max values for each column
HIVE | 52

ORC File Index Skipping
HIVE | 53
Skipping works for number types and for string types.
Done by recording a min and max value inside the inline index
and determining if the lookup value falls outside that range.

The ORC File Format: File Footer
• List of stripes in the file, the number of rows per stripe,
each column's data type.
• Column-level aggregates: count, min, max, and sum.
• ORC uses files footer to find the columns data streams.
HIVE | 54

Predicate Pushdowns
• “Push down” parts of the query to where the data is.
• ﬁlter/skip as much data as possible, and
• greatly reduce input size.
• Sorting a table on its secondary keys also reduces
execution time.
• Sorted columns are grouped together in one area on
disk and the other pieces will be skipped very quickly.
HIVE | 55

Query Performance
• Lower latency Hive queries rely on two major factors:
• Sorting and skipping data as much as possible
• Minimizing data shufﬂe from mappers to reducers
HIVE | 57

Improving Query Performance
• Divide data among different ﬁles/directories
• Partitions, buckets, etc.
• Skip records using small embedded indexes.
• ORCFile format.
• Sort data ahead of time.
• Simpliﬁes joins making ORCFile skipping more
effective.
HIVE | 58

The Big Picture
DATA ENGINEERING | 59
Data Preprocessing
HDFS HDFSMapReduce
Start Here
JSON Hive File
Data Load
Dynamic Load
Partioning
Bucketing
Indexing
HDFS
Hive File
Staging Table Prod Table
Data Access
API Hive CLI
Apache Thrift

THANK YOU!
Get in touch:
alexvsilva@gmail.com
@thealexsilva

Data Engineering with Spring, Hadoop and Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Engineering with Spring, Hadoop and Hive

Similar to Data Engineering with Spring, Hadoop and Hive (20)

Recently uploaded

Recently uploaded (20)

Data Engineering with Spring, Hadoop and Hive