This presentation will outline the evolution of the monitoring data platform pipeline at Rackspace and explore the compute and data management challenges we have faced at this scale. We will focus on our use of Hadoop and Hive as data storage and transformation platforms while discussing the technology stack, key architectural decisions, observations and pitfalls encountered in building the pipeline.
1. Faster Data Flows with Hive, Spring and Hadoop
Alex Silva
Principal Data Engineer
2. DATA ADVENTURES AT RACKSPACE
• Datasets
• Data pipeline: flows and systems
• Creating a generic Hadoop ETL framework
• Integrating Hadoop with Spring
• Spring Hadoop, Spring Bach and Spring Boot
• Hive
• File formats
• Queries and performance
3. MAAS Dataset
• System and platform monitoring
• Pings, SSH, HTTP, HTTPS checks
• Remote monitoring
• CPU, file system, load average, disk memory
• MySQL, Apache
THE BUSINESS DOMAIN | 3
4. The Dataset
• Processing around 1.5B records/day
• Stored in Cassandra
• Exported to HDFS in batches
• TBs of uncompressed JSON (“raw data”) daily
• First dataset piped through ETL platform
DATA ENGINEERING STATS | 4
5. DATA PIPELINE
• Data flow
• Stages
• ETL
• Input formats
• Generic Transformation Layer
• Outputs
6. Data Flow Diagram
DATA FLOW | 6
Monitoring
JSON Export
HDFS
Start
Available and
well-formed?
No
Stop
EXTRACT AND TRANSFORM
BAD ROW
OR ERROR?
LOG
CSV
STAGING
FILE
ETL
JSON DATA
HDFSYes
Yes No
LOAD
Partioning
Bucketing
Indexing
Staging Table Production Table
ETL
Hive Table
Flume
7. Systems Diagram
SYSTEMS | 7
Monitoring
Events
HDFS
JSON
Extract
MapReduce
1.2.0.1.3.2.0
Load
Hive
0.12.0
Flume Log4J
Appender
Flume
1.5.0
Access
End User
Bad records sink
Export
8. ETL Summary
• Extract
• JSON files in HDFS
• Transform
• Generic Java based ETL framework
• MapReduce jobs extract features
• Quality checks
• Load
• Load data into partitioned ORC Hive tables
DATA FLOW | 8
10. Hadoop: Pros
• Dataset volume
• Data is grows exponentially at a very rapid rate
• Integrates with existing ecosystem
• HiveQL
• Experimentation and exploration
• No expensive software or hardware to buy
TOOLS AND TECHNOLOGIES | 10
11. Hadoop: Cons
• Job monitoring and scheduling
• Data quality
• Error handling and notification
• Programming model
• Generic framework mitigates some of that
TOOLS AND TECHNOLOGIES | 11
13. Keeping the Elephant “Lean”
• Job control without the complexity of external tools
• Checks and validations
• Unified configuration model
• Integration with scripts
• Automation
• Job restartability
DATA ENGINEERING | 13
16. What is it about?
• Part of the Spring Framework
• Run Hadoop apps as standard Java apps using DI
• Unified declarative configuration model
• APIs to run MapReduce, Hive, and Pig jobs.
• Script HDFS operations using any JVM based languages.
• Supports both classic MR and YARN
TOOLS AND TECHNOLOGIES | 16
17. The Apache Hadoop Namespace
TOOLS AND TECHNOLOGIES | 17
Also supports annotation based configuration via the
@EnableHadoop annotation.
18. Job Configuration: Standard Hadoop APIs
TOOLS AND TECHNOLOGIES | 18
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
Job.setJarByClass(WordCountMapper.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
job.waitForCompletion(true);
19. Configuring Hadoop with Spring
SPRING HADOOP | 19
<context:property-placeholder location="hadoop-dev.properties"/>
<hdp:configuration>
fs.default.name=${hd.fs}
</hdp:configuration>
<hdp:job id="word-count-job"
input-path=“${input.path}"
output-path="${output.path}“
jar=“hadoop-examples.jar”
mapper="examples.WordCount.WordMapper“
reducer="examples.WordCount.IntSumReducer"/>
<hdp:job-runner id=“runner” job-ref="word-count-job“
run-at-startup=“true“ />
input.path=/wc/input/-
output.path=/wc/word/-
hd.fs=hdfs://localhost:9000
22. Injecting Jobs
• Use DI to obtain reference to Spring managed Hadoop
job
• Perform additional validation and configuration before
submitting
TOOLS AND TECHNOLOGIES | 22
public'class'WordService'{'
'
''@Autowired'
''private'Job'mapReduceJob;'''
'
''public'void'processWords()'{'''''
''''mapReduceJob.submit();'
''}'
}'
27. Scripting Support in HDFS
• FSShell is designed to support scripting languages
• Use these for housekeeping tasks:
• Check for files, prepare input data, clean output
directories, set flags, etc.
TOOLS AND TECHNOLOGIES | 27
29. What is it about?
• Born out of collaboration with Accenture in 2007
• Fully automated processing of large volumes of data.
• Logging, txn management, listeners, job statistics,
restart, skipping, and resource management.
• Automatic retries after failure
• Synch, async and parallel processing
• Data partitioning
TOOLS AND TECHNOLOGIES | 29
30. Hadoop Workflow Orchestration
• Complex data flows
• Reuses batch infrastructure to manage Hadoop workflows.
• Steps can be any Hadoop job type or HDFS script
• Jobs can be invoked by events or scheduled.
• Steps can be sequential, conditional, split,
concurrent, or programmatically determined.
• Works with flat files, XML, or databases.
TOOLS AND TECHNOLOGIES | 30
34. What is it about?
• Builds production-ready Spring applications.
• Creates a “runnable” jar with dependencies and classpath settings.
• Can embed Tomcat or Jetty within the JAR
• Automatic configuration
• Out of the box features:
• statistics, metrics, health checks and externalized configuration
• No code generation and no requirement for XML configuration.
TOOLS AND TECHNOLOGIES | 34
36. Spring Data Flow Components
TOOLS AND TECHNOLOGIES | 36
Spring Boot
Extract
Spring Batch
2.0
Load
Spring Hadoop
2.01.1.5
HDFS
Hive
0.12.0
MapReduce
HDP 1.3
37. Hierarchical View
TOOLS AND TECHNOLOGIES | 37
Spring Boot
Spring Batch
Job control
Spring Hadoop
- Notifications
- Validation
- Scheduling
- Data Flow
- Callbacks
39. Spring Hadoop Configuration
• Job parameters configured by Spring
• Sensible defaults used
• Parameters can be overridden:
• External properties file.
• At runtime via system properties: -Dproperty.name = property.value
TOOLS AND TECHNOLOGIES | 39
<configuration>
fs.default.name=${hd.fs}
io.sort.mb=${io.sort.mb:640mb}
mapred.reduce.tasks=${mapred.reduce.tasks:1}
mapred.job.tracker=${hd.jt:local}
mapred.child.java.opts=${mapred.child.java.opts}
</configuration>
40. MapReduce Jobs
• Configured via Spring Hadoop
• One job per entity
TOOLS AND TECHNOLOGIES | 40
<job id="metricsMR"
input-path="${mapred.input.path}"
output-path="${mapred.output.path}"
mapper="GenericETLMapper"
reducer="GenericETLReducer”
input-format="org.apache.hadoop.mapreduce.lib.input.TextInputFormat"
output-format="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat"
key="TextArrayWritable"
value="org.apache.hadoop.io.NullWritable"
map-key="org.apache.hadoop.io.Text"
map-value="org.apache.hadoop.io.Text"
jar-by-class="GenericETLMapper">
volga.etl.dto.class=Metric
</job>
41. MapReduce Jobs
• Jobs are wrapped into Tasklet definitions
TOOLS AND TECHNOLOGIES | 41
<job-tasklet job-ref="metricsMR" id="metricsJobTasklet"/>
42. Hive Configuration
• Hive steps also defined as tasklets
• Parameters are passed from MapReduce phase to Hive
phase
TOOLS AND TECHNOLOGIES | 42
<hive-client-factory host="${hive.host}" port="${hive.port:10000}"/>
<hive-tasklet id="load-notifications">
<script location="classpath:hive/ddl/notifications-load.hql"/>
</hive-tasklet>
<hive-tasklet id="load-metrics">
<script location="classpath:hive/ddl/metrics-load.hql">
<arguments>INPUT_PATH=${mapreduce.output.path}</arguments>
</script>
</hive-tasklet>
44. Spring Batch Listeners
• Monitor job flow
• Take action on job failure
• PagerDuty notifications
• Save job counters to the audit database
• Notify team if counters are not consistent with historical
audit data (based on thresholds)
TOOLS AND TECHNOLOGIES | 44
45. Spring Boot: Pulling Everything Together
• Runnable jar created during build process
• Controlled by Maven plugin
TOOLS AND TECHNOLOGIES | 45
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<finalName>maas-etl-${project.version}</finalName>
<classifier>spring</classifier>
<mainClass>com.rackspace....JobRunner</mainClass>
<excludeGroupIds>org.slf4j</excludeGroupIds>
</configuration>
</plugin>
46. HIVE
• Typical Use Cases
• File formats
• ORC
• Abstractions
• Hive in the monitoring pipeline
• Query performance
47. Overview
• Translates SQL commands into MR jobs.
• Structured and unstructured data in multiple formats
• Standard access protocols, including JDBC and Thrift
• Provides several serialization mechanisms
• Integrates seamlessly with Hadoop: HCatalog, Pig,
HBase, etc.
HIVE | 47
48. Hive vs. RDBMS
HIVE | 48
Hive Traditional Databases
SQL Interface SQL Interface
Focus on batch analytics Mostly online, interactive analytics
No transactions Transactions are their way of life
No random inserts
Updates are not natively supported (but possible.)
Random insert and updates
Distributed processing via MR Distributed processing capabilities vary
Scales to hundreds of nodes Seldom scales beyond 20 nodes
Built for commodity hardware Expensive, proprietary hardware
Low cost per petabyte What’s a petabyte?
49. Abstraction Layers in Hive
49HIVE |
Database
Table
Partition
Skewed Keys
Table
Partition Partition
Unskewed
Keys
B
u
c
k
e
t
B
u
c
k
e
t
B
u
c
k
e
t
Optional
50. Schemas and File Formats
• We used the ORCFile format: built-in, easy to use and efficient.
• Efficient light-weight + generic compression
• Run length encoding for integers and strings, dictionary encoding, etc.
• Generic compression: Snappy, LZO, and ZLib (default)
• High performance
• Indexes value ranges within blocks of ORCFile data
• Predicate filter pushdown allows efficient scanning during queries.
• Flexible Data Model
• Hive types are supported including maps, structs and unions.
HIVE | 50
51. The ORC File Format
• An ORC file contains groups of row data called stripes,
along with auxiliary information in a file footer.
• Default size is 256 MB (orc.stripe.size).
• Large stripes allow for efficient reads from HDFS
configured independently from the block size.
HIVE | 51
52. The ORC File Format: Index
• Doesn’t answer queries
• Required for skipping rows:
• Row index entries provide offsets that enable seeking
• Min and max values for each column
HIVE | 52
53. ORC File Index Skipping
HIVE | 53
Skipping works for number types and for string types.
Done by recording a min and max value inside the inline index
and determining if the lookup value falls outside that range.
54. The ORC File Format: File Footer
• List of stripes in the file, the number of rows per stripe,
each column's data type.
• Column-level aggregates: count, min, max, and sum.
• ORC uses files footer to find the columns data streams.
HIVE | 54
55. Predicate Pushdowns
• “Push down” parts of the query to where the data is.
• filter/skip as much data as possible, and
• greatly reduce input size.
• Sorting a table on its secondary keys also reduces
execution time.
• Sorted columns are grouped together in one area on
disk and the other pieces will be skipped very quickly.
HIVE | 55
57. Query Performance
• Lower latency Hive queries rely on two major factors:
• Sorting and skipping data as much as possible
• Minimizing data shuffle from mappers to reducers
HIVE | 57
58. Improving Query Performance
• Divide data among different files/directories
• Partitions, buckets, etc.
• Skip records using small embedded indexes.
• ORCFile format.
• Sort data ahead of time.
• Simplifies joins making ORCFile skipping more
effective.
HIVE | 58
59. The Big Picture
DATA ENGINEERING | 59
Data Preprocessing
HDFS HDFSMapReduce
Start Here
JSON Hive File
Data Load
Dynamic Load
Partioning
Bucketing
Indexing
HDFS
Hive File
Staging Table Prod Table
Data Access
API Hive CLI
Apache Thrift