SlideShare a Scribd company logo
1 of 60
Download to read offline
Faster Data Flows with Hive, Spring and Hadoop
Alex Silva

Principal Data Engineer
DATA ADVENTURES AT RACKSPACE
• Datasets
• Data pipeline: flows and systems
• Creating a generic Hadoop ETL framework
• Integrating Hadoop with Spring
• Spring Hadoop, Spring Bach and Spring Boot
• Hive
• File formats
• Queries and performance
MAAS Dataset
• System and platform monitoring
• Pings, SSH, HTTP, HTTPS checks
• Remote monitoring
• CPU, file system, load average, disk memory
• MySQL, Apache
THE BUSINESS DOMAIN | 3
The Dataset
• Processing around 1.5B records/day
• Stored in Cassandra
• Exported to HDFS in batches
• TBs of uncompressed JSON (“raw data”) daily
• First dataset piped through ETL platform
DATA ENGINEERING STATS | 4
DATA PIPELINE
• Data flow
• Stages
• ETL
• Input formats
• Generic Transformation Layer
• Outputs
Data Flow Diagram
DATA FLOW | 6
Monitoring
JSON Export
HDFS
Start
Available and
well-formed?
No
Stop
EXTRACT AND TRANSFORM
BAD ROW
OR ERROR?
LOG
CSV
STAGING
FILE
ETL
JSON DATA
HDFSYes
Yes No
LOAD
Partioning
Bucketing
Indexing
Staging Table Production Table
ETL
Hive Table
Flume
Systems Diagram
SYSTEMS | 7
Monitoring
Events
HDFS
JSON
Extract
MapReduce
1.2.0.1.3.2.0
Load
Hive
0.12.0
Flume Log4J
Appender
Flume
1.5.0
Access
End User
Bad records sink
Export
ETL Summary
• Extract
• JSON files in HDFS
• Transform
• Generic Java based ETL framework
• MapReduce jobs extract features
• Quality checks
• Load
• Load data into partitioned ORC Hive tables
DATA FLOW | 8
HADOOP
Hadoop: Pros
• Dataset volume
• Data is grows exponentially at a very rapid rate
• Integrates with existing ecosystem
• HiveQL
• Experimentation and exploration
• No expensive software or hardware to buy
TOOLS AND TECHNOLOGIES | 10
Hadoop: Cons
• Job monitoring and scheduling
• Data quality
• Error handling and notification
• Programming model
• Generic framework mitigates some of that
TOOLS AND TECHNOLOGIES | 11
CAN WE OVERCOME SOME OF THOSE?
Keeping the Elephant “Lean”
• Job control without the complexity of external tools
• Checks and validations
• Unified configuration model
• Integration with scripts
• Automation
• Job restartability
DATA ENGINEERING | 13
HEY! WHAT ABOUT SPRING?
SPRING DATA HADOOP
What is it about?
• Part of the Spring Framework
• Run Hadoop apps as standard Java apps using DI
• Unified declarative configuration model
• APIs to run MapReduce, Hive, and Pig jobs.
• Script HDFS operations using any JVM based languages.
• Supports both classic MR and YARN
TOOLS AND TECHNOLOGIES | 16
The Apache Hadoop Namespace
TOOLS AND TECHNOLOGIES | 17
Also supports annotation based configuration via the
@EnableHadoop annotation.
Job Configuration: Standard Hadoop APIs
TOOLS AND TECHNOLOGIES | 18
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
Job.setJarByClass(WordCountMapper.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
job.waitForCompletion(true);
Configuring Hadoop with Spring
SPRING HADOOP | 19
<context:property-placeholder location="hadoop-dev.properties"/>
<hdp:configuration>
fs.default.name=${hd.fs}
</hdp:configuration>
<hdp:job id="word-count-job"
input-path=“${input.path}"
output-path="${output.path}“
jar=“hadoop-examples.jar”
mapper="examples.WordCount.WordMapper“
reducer="examples.WordCount.IntSumReducer"/>
<hdp:job-runner id=“runner” job-ref="word-count-job“
run-at-startup=“true“ />
input.path=/wc/input/-
output.path=/wc/word/-
hd.fs=hdfs://localhost:9000
SPRING HADOOP | 20
Configuration Attributes
Creating a Job
SPRING HADOOP | 21
Injecting Jobs
• Use DI to obtain reference to Spring managed Hadoop
job
• Perform additional validation and configuration before
submitting
TOOLS AND TECHNOLOGIES | 22
public'class'WordService'{'
'
''@Autowired'
''private'Job'mapReduceJob;'''
'
''public'void'processWords()'{'''''
''''mapReduceJob.submit();'
''}'
}'
Running a Job
TOOLS AND TECHNOLOGIES | 23
Distributed Cache
TOOLS AND TECHNOLOGIES | 24
Using Scripts
TOOLS AND TECHNOLOGIES | 25
Scripting Implicit Variables
TOOLS AND TECHNOLOGIES | 26
Scripting Support in HDFS
• FSShell is designed to support scripting languages
• Use these for housekeeping tasks:
• Check for files, prepare input data, clean output
directories, set flags, etc.
TOOLS AND TECHNOLOGIES | 27
SPRING BATCH
What is it about?
• Born out of collaboration with Accenture in 2007
• Fully automated processing of large volumes of data.
• Logging, txn management, listeners, job statistics,
restart, skipping, and resource management.
• Automatic retries after failure
• Synch, async and parallel processing
• Data partitioning
TOOLS AND TECHNOLOGIES | 29
Hadoop Workflow Orchestration
• Complex data flows
• Reuses batch infrastructure to manage Hadoop workflows.
• Steps can be any Hadoop job type or HDFS script
• Jobs can be invoked by events or scheduled.
• Steps can be sequential, conditional, split, 

concurrent, or programmatically determined.
• Works with flat files, XML, or databases.
TOOLS AND TECHNOLOGIES | 30
Spring Batch Configuration
• Jobs are composed of steps
TOOLS AND TECHNOLOGIES | 31
<job id="job1">
<step id="import" next="wordcount">
<tasklet ref=“import-tasklet"/>
</step>
<step id=“wc" next="pig">
<tasklet ref="wordcount-tasklet"/>
</step>
<step id="pig">
<tasklet ref="pig-tasklet“></step>
<split id="parallel" next="hdfs">
<flow><step id="mrStep">
<tasklet ref="mr-tasklet"/>
</step></flow>
<flow><step id="hive">
<tasklet ref="hive-tasklet"/>
</step></flow>
</split>
<step id="hdfs">
<tasklet ref="hdfs-tasklet"/></step>
</job>
Spring Data Hadoop Integration
TOOLS AND TECHNOLOGIES | 32
SPRING BOOT
What is it about?
• Builds production-ready Spring applications.
• Creates a “runnable” jar with dependencies and classpath settings.
• Can embed Tomcat or Jetty within the JAR
• Automatic configuration
• Out of the box features:
• statistics, metrics, health checks and externalized configuration
• No code generation and no requirement for XML configuration.
TOOLS AND TECHNOLOGIES | 34
PUTTING IT ALL TOGETHER
Spring Data Flow Components
TOOLS AND TECHNOLOGIES | 36
Spring Boot
Extract
Spring Batch
2.0
Load
Spring Hadoop
2.01.1.5
HDFS
Hive
0.12.0
MapReduce
HDP 1.3
Hierarchical View
TOOLS AND TECHNOLOGIES | 37
Spring Boot
Spring Batch
Job control
Spring Hadoop
- Notifications
- Validation
- Scheduling
- Data Flow
- Callbacks
HADOOP DATA FLOWS, SPRINGFIED
Spring Hadoop Configuration
• Job parameters configured by Spring
• Sensible defaults used
• Parameters can be overridden:
• External properties file.
• At runtime via system properties: -Dproperty.name = property.value
TOOLS AND TECHNOLOGIES | 39
<configuration>
fs.default.name=${hd.fs}
io.sort.mb=${io.sort.mb:640mb}
mapred.reduce.tasks=${mapred.reduce.tasks:1}
mapred.job.tracker=${hd.jt:local}
mapred.child.java.opts=${mapred.child.java.opts}
</configuration>
MapReduce Jobs
• Configured via Spring Hadoop
• One job per entity
TOOLS AND TECHNOLOGIES | 40
<job id="metricsMR"
input-path="${mapred.input.path}"
output-path="${mapred.output.path}"
mapper="GenericETLMapper"
reducer="GenericETLReducer”
input-format="org.apache.hadoop.mapreduce.lib.input.TextInputFormat"
output-format="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat"
key="TextArrayWritable"
value="org.apache.hadoop.io.NullWritable"
map-key="org.apache.hadoop.io.Text"
map-value="org.apache.hadoop.io.Text"
jar-by-class="GenericETLMapper">
volga.etl.dto.class=Metric
</job>
MapReduce Jobs
• Jobs are wrapped into Tasklet definitions
TOOLS AND TECHNOLOGIES | 41
<job-tasklet job-ref="metricsMR" id="metricsJobTasklet"/>
Hive Configuration
• Hive steps also defined as tasklets
• Parameters are passed from MapReduce phase to Hive
phase
TOOLS AND TECHNOLOGIES | 42
<hive-client-factory host="${hive.host}" port="${hive.port:10000}"/>
<hive-tasklet id="load-notifications">
<script location="classpath:hive/ddl/notifications-load.hql"/>
</hive-tasklet>
<hive-tasklet id="load-metrics">
<script location="classpath:hive/ddl/metrics-load.hql">
<arguments>INPUT_PATH=${mapreduce.output.path}</arguments>
</script>
</hive-tasklet>
Spring Batch Configuration
• One Spring Batch job per entity.
TOOLS AND TECHNOLOGIES | 43
<job id="metrics" restartable="false" parent="VolgaETLJob">
<step id="cleanMetricsOutputDirectory" next="metricsMapReduce">
<tasklet ref="setUpJobTasklet"/>
</step>
<step id="metricsMapReduce">
<tasklet ref="metricsJobTasklet">
<listeners>
<listener ref="mapReduceErrorThresholdListener"/>
</listeners>
</tasklet>
<fail on="FAILED" exit-code="Map Reduce Step Failed"/>
<end on="COMPLETED"/>
<!--<next on="*" to="loadMetricsIntoHive"/>-->
</step>
<step id="loadMetricsIntoHive">
<tasklet ref="load-notifications"/>
</step>
</job>
Spring Batch Listeners
• Monitor job flow
• Take action on job failure
• PagerDuty notifications
• Save job counters to the audit database
• Notify team if counters are not consistent with historical
audit data (based on thresholds)
TOOLS AND TECHNOLOGIES | 44
Spring Boot: Pulling Everything Together
• Runnable jar created during build process
• Controlled by Maven plugin
TOOLS AND TECHNOLOGIES | 45
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<finalName>maas-etl-${project.version}</finalName>
<classifier>spring</classifier>
<mainClass>com.rackspace....JobRunner</mainClass>
<excludeGroupIds>org.slf4j</excludeGroupIds>
</configuration>
</plugin>
HIVE
• Typical Use Cases
• File formats
• ORC
• Abstractions
• Hive in the monitoring pipeline
• Query performance
Overview
• Translates SQL commands into MR jobs.
• Structured and unstructured data in multiple formats
• Standard access protocols, including JDBC and Thrift
• Provides several serialization mechanisms
• Integrates seamlessly with Hadoop: HCatalog, Pig,
HBase, etc.
HIVE | 47
Hive vs. RDBMS
HIVE | 48
Hive Traditional Databases
SQL Interface SQL Interface
Focus on batch analytics Mostly online, interactive analytics
No transactions Transactions are their way of life
No random inserts

Updates are not natively supported (but possible.)
Random insert and updates
Distributed processing via MR Distributed processing capabilities vary
Scales to hundreds of nodes Seldom scales beyond 20 nodes
Built for commodity hardware Expensive, proprietary hardware
Low cost per petabyte What’s a petabyte?
Abstraction Layers in Hive
49HIVE |
Database
Table
Partition
Skewed Keys
Table
Partition Partition
Unskewed
Keys
B
u
c
k
e
t
B
u
c
k
e
t
B
u
c
k
e
t
Optional
Schemas and File Formats
• We used the ORCFile format: built-in, easy to use and efficient.
• Efficient light-weight + generic compression
• Run length encoding for integers and strings, dictionary encoding, etc.
• Generic compression: Snappy, LZO, and ZLib (default)
• High performance
• Indexes value ranges within blocks of ORCFile data
• Predicate filter pushdown allows efficient scanning during queries.
• Flexible Data Model
• Hive types are supported including maps, structs and unions.
HIVE | 50
The ORC File Format
• An ORC file contains groups of row data called stripes,
along with auxiliary information in a file footer.
• Default size is 256 MB (orc.stripe.size).
• Large stripes allow for efficient reads from HDFS
configured independently from the block size.
HIVE | 51
The ORC File Format: Index
• Doesn’t answer queries
• Required for skipping rows:
• Row index entries provide offsets that enable seeking
• Min and max values for each column
HIVE | 52
ORC File Index Skipping
HIVE | 53
Skipping works for number types and for string types.
Done by recording a min and max value inside the inline index
and determining if the lookup value falls outside that range.
The ORC File Format: File Footer
• List of stripes in the file, the number of rows per stripe,
each column's data type.
• Column-level aggregates: count, min, max, and sum.
• ORC uses files footer to find the columns data streams.
HIVE | 54
Predicate Pushdowns
• “Push down” parts of the query to where the data is.
• filter/skip as much data as possible, and
• greatly reduce input size.
• Sorting a table on its secondary keys also reduces
execution time.
• Sorted columns are grouped together in one area on
disk and the other pieces will be skipped very quickly.
HIVE | 55
56HIVE |
ORC File
Query Performance
• Lower latency Hive queries rely on two major factors:
• Sorting and skipping data as much as possible
• Minimizing data shuffle from mappers to reducers
HIVE | 57
Improving Query Performance
• Divide data among different files/directories
• Partitions, buckets, etc.
• Skip records using small embedded indexes.
• ORCFile format.
• Sort data ahead of time.
• Simplifies joins making ORCFile skipping more
effective.
HIVE | 58
The Big Picture
DATA ENGINEERING | 59
Data Preprocessing
HDFS HDFSMapReduce
Start Here
JSON Hive File
Data Load
Dynamic Load
Partioning
Bucketing
Indexing
HDFS
Hive File
Staging Table Prod Table
Data Access
API Hive CLI
Apache Thrift
THANK YOU!
Get in touch:
alexvsilva@gmail.com
@thealexsilva

More Related Content

What's hot

Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From UberChester Chen
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Toshihiro Suzuki
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 

What's hot (20)

Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
BIG DATA ANALYSIS
BIG DATA ANALYSISBIG DATA ANALYSIS
BIG DATA ANALYSIS
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 

Viewers also liked

Installing apache sqoop
Installing apache sqoopInstalling apache sqoop
Installing apache sqoopEnrique Davila
 
Load data into hive and csv
Load data into hive and csvLoad data into hive and csv
Load data into hive and csvEnrique Davila
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerdeZheng Shao
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Thomas Vanhove
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveNamit Jain
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object ModelZheng Shao
 
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataBig Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataKai Wähner
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Replacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and HiveReplacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and HiveJunHo Cho
 

Viewers also liked (20)

Hive tuning
Hive tuningHive tuning
Hive tuning
 
Installing apache sqoop
Installing apache sqoopInstalling apache sqoop
Installing apache sqoop
 
Load data into hive and csv
Load data into hive and csvLoad data into hive and csv
Load data into hive and csv
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataBig Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your Data
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Replacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and HiveReplacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and Hive
 

Similar to Data Engineering with Spring, Hadoop and Hive

Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingCascading
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
Spring Batch Performance Tuning
Spring Batch Performance TuningSpring Batch Performance Tuning
Spring Batch Performance TuningGunnar Hillert
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudSoam Acharya
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 

Similar to Data Engineering with Spring, Hadoop and Hive (20)

Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Spring Batch Performance Tuning
Spring Batch Performance TuningSpring Batch Performance Tuning
Spring Batch Performance Tuning
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 

Recently uploaded

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 

Recently uploaded (20)

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 

Data Engineering with Spring, Hadoop and Hive

  • 1. Faster Data Flows with Hive, Spring and Hadoop Alex Silva Principal Data Engineer
  • 2. DATA ADVENTURES AT RACKSPACE • Datasets • Data pipeline: flows and systems • Creating a generic Hadoop ETL framework • Integrating Hadoop with Spring • Spring Hadoop, Spring Bach and Spring Boot • Hive • File formats • Queries and performance
  • 3. MAAS Dataset • System and platform monitoring • Pings, SSH, HTTP, HTTPS checks • Remote monitoring • CPU, file system, load average, disk memory • MySQL, Apache THE BUSINESS DOMAIN | 3
  • 4. The Dataset • Processing around 1.5B records/day • Stored in Cassandra • Exported to HDFS in batches • TBs of uncompressed JSON (“raw data”) daily • First dataset piped through ETL platform DATA ENGINEERING STATS | 4
  • 5. DATA PIPELINE • Data flow • Stages • ETL • Input formats • Generic Transformation Layer • Outputs
  • 6. Data Flow Diagram DATA FLOW | 6 Monitoring JSON Export HDFS Start Available and well-formed? No Stop EXTRACT AND TRANSFORM BAD ROW OR ERROR? LOG CSV STAGING FILE ETL JSON DATA HDFSYes Yes No LOAD Partioning Bucketing Indexing Staging Table Production Table ETL Hive Table Flume
  • 7. Systems Diagram SYSTEMS | 7 Monitoring Events HDFS JSON Extract MapReduce 1.2.0.1.3.2.0 Load Hive 0.12.0 Flume Log4J Appender Flume 1.5.0 Access End User Bad records sink Export
  • 8. ETL Summary • Extract • JSON files in HDFS • Transform • Generic Java based ETL framework • MapReduce jobs extract features • Quality checks • Load • Load data into partitioned ORC Hive tables DATA FLOW | 8
  • 10. Hadoop: Pros • Dataset volume • Data is grows exponentially at a very rapid rate • Integrates with existing ecosystem • HiveQL • Experimentation and exploration • No expensive software or hardware to buy TOOLS AND TECHNOLOGIES | 10
  • 11. Hadoop: Cons • Job monitoring and scheduling • Data quality • Error handling and notification • Programming model • Generic framework mitigates some of that TOOLS AND TECHNOLOGIES | 11
  • 12. CAN WE OVERCOME SOME OF THOSE?
  • 13. Keeping the Elephant “Lean” • Job control without the complexity of external tools • Checks and validations • Unified configuration model • Integration with scripts • Automation • Job restartability DATA ENGINEERING | 13
  • 14. HEY! WHAT ABOUT SPRING?
  • 16. What is it about? • Part of the Spring Framework • Run Hadoop apps as standard Java apps using DI • Unified declarative configuration model • APIs to run MapReduce, Hive, and Pig jobs. • Script HDFS operations using any JVM based languages. • Supports both classic MR and YARN TOOLS AND TECHNOLOGIES | 16
  • 17. The Apache Hadoop Namespace TOOLS AND TECHNOLOGIES | 17 Also supports annotation based configuration via the @EnableHadoop annotation.
  • 18. Job Configuration: Standard Hadoop APIs TOOLS AND TECHNOLOGIES | 18 Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
  • 19. Configuring Hadoop with Spring SPRING HADOOP | 19 <context:property-placeholder location="hadoop-dev.properties"/> <hdp:configuration> fs.default.name=${hd.fs} </hdp:configuration> <hdp:job id="word-count-job" input-path=“${input.path}" output-path="${output.path}“ jar=“hadoop-examples.jar” mapper="examples.WordCount.WordMapper“ reducer="examples.WordCount.IntSumReducer"/> <hdp:job-runner id=“runner” job-ref="word-count-job“ run-at-startup=“true“ /> input.path=/wc/input/- output.path=/wc/word/- hd.fs=hdfs://localhost:9000
  • 20. SPRING HADOOP | 20 Configuration Attributes
  • 21. Creating a Job SPRING HADOOP | 21
  • 22. Injecting Jobs • Use DI to obtain reference to Spring managed Hadoop job • Perform additional validation and configuration before submitting TOOLS AND TECHNOLOGIES | 22 public'class'WordService'{' ' ''@Autowired' ''private'Job'mapReduceJob;''' ' ''public'void'processWords()'{''''' ''''mapReduceJob.submit();' ''}' }'
  • 23. Running a Job TOOLS AND TECHNOLOGIES | 23
  • 24. Distributed Cache TOOLS AND TECHNOLOGIES | 24
  • 25. Using Scripts TOOLS AND TECHNOLOGIES | 25
  • 26. Scripting Implicit Variables TOOLS AND TECHNOLOGIES | 26
  • 27. Scripting Support in HDFS • FSShell is designed to support scripting languages • Use these for housekeeping tasks: • Check for files, prepare input data, clean output directories, set flags, etc. TOOLS AND TECHNOLOGIES | 27
  • 29. What is it about? • Born out of collaboration with Accenture in 2007 • Fully automated processing of large volumes of data. • Logging, txn management, listeners, job statistics, restart, skipping, and resource management. • Automatic retries after failure • Synch, async and parallel processing • Data partitioning TOOLS AND TECHNOLOGIES | 29
  • 30. Hadoop Workflow Orchestration • Complex data flows • Reuses batch infrastructure to manage Hadoop workflows. • Steps can be any Hadoop job type or HDFS script • Jobs can be invoked by events or scheduled. • Steps can be sequential, conditional, split, 
 concurrent, or programmatically determined. • Works with flat files, XML, or databases. TOOLS AND TECHNOLOGIES | 30
  • 31. Spring Batch Configuration • Jobs are composed of steps TOOLS AND TECHNOLOGIES | 31 <job id="job1"> <step id="import" next="wordcount"> <tasklet ref=“import-tasklet"/> </step> <step id=“wc" next="pig"> <tasklet ref="wordcount-tasklet"/> </step> <step id="pig"> <tasklet ref="pig-tasklet“></step> <split id="parallel" next="hdfs"> <flow><step id="mrStep"> <tasklet ref="mr-tasklet"/> </step></flow> <flow><step id="hive"> <tasklet ref="hive-tasklet"/> </step></flow> </split> <step id="hdfs"> <tasklet ref="hdfs-tasklet"/></step> </job>
  • 32. Spring Data Hadoop Integration TOOLS AND TECHNOLOGIES | 32
  • 34. What is it about? • Builds production-ready Spring applications. • Creates a “runnable” jar with dependencies and classpath settings. • Can embed Tomcat or Jetty within the JAR • Automatic configuration • Out of the box features: • statistics, metrics, health checks and externalized configuration • No code generation and no requirement for XML configuration. TOOLS AND TECHNOLOGIES | 34
  • 35. PUTTING IT ALL TOGETHER
  • 36. Spring Data Flow Components TOOLS AND TECHNOLOGIES | 36 Spring Boot Extract Spring Batch 2.0 Load Spring Hadoop 2.01.1.5 HDFS Hive 0.12.0 MapReduce HDP 1.3
  • 37. Hierarchical View TOOLS AND TECHNOLOGIES | 37 Spring Boot Spring Batch Job control Spring Hadoop - Notifications - Validation - Scheduling - Data Flow - Callbacks
  • 38. HADOOP DATA FLOWS, SPRINGFIED
  • 39. Spring Hadoop Configuration • Job parameters configured by Spring • Sensible defaults used • Parameters can be overridden: • External properties file. • At runtime via system properties: -Dproperty.name = property.value TOOLS AND TECHNOLOGIES | 39 <configuration> fs.default.name=${hd.fs} io.sort.mb=${io.sort.mb:640mb} mapred.reduce.tasks=${mapred.reduce.tasks:1} mapred.job.tracker=${hd.jt:local} mapred.child.java.opts=${mapred.child.java.opts} </configuration>
  • 40. MapReduce Jobs • Configured via Spring Hadoop • One job per entity TOOLS AND TECHNOLOGIES | 40 <job id="metricsMR" input-path="${mapred.input.path}" output-path="${mapred.output.path}" mapper="GenericETLMapper" reducer="GenericETLReducer” input-format="org.apache.hadoop.mapreduce.lib.input.TextInputFormat" output-format="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat" key="TextArrayWritable" value="org.apache.hadoop.io.NullWritable" map-key="org.apache.hadoop.io.Text" map-value="org.apache.hadoop.io.Text" jar-by-class="GenericETLMapper"> volga.etl.dto.class=Metric </job>
  • 41. MapReduce Jobs • Jobs are wrapped into Tasklet definitions TOOLS AND TECHNOLOGIES | 41 <job-tasklet job-ref="metricsMR" id="metricsJobTasklet"/>
  • 42. Hive Configuration • Hive steps also defined as tasklets • Parameters are passed from MapReduce phase to Hive phase TOOLS AND TECHNOLOGIES | 42 <hive-client-factory host="${hive.host}" port="${hive.port:10000}"/> <hive-tasklet id="load-notifications"> <script location="classpath:hive/ddl/notifications-load.hql"/> </hive-tasklet> <hive-tasklet id="load-metrics"> <script location="classpath:hive/ddl/metrics-load.hql"> <arguments>INPUT_PATH=${mapreduce.output.path}</arguments> </script> </hive-tasklet>
  • 43. Spring Batch Configuration • One Spring Batch job per entity. TOOLS AND TECHNOLOGIES | 43 <job id="metrics" restartable="false" parent="VolgaETLJob"> <step id="cleanMetricsOutputDirectory" next="metricsMapReduce"> <tasklet ref="setUpJobTasklet"/> </step> <step id="metricsMapReduce"> <tasklet ref="metricsJobTasklet"> <listeners> <listener ref="mapReduceErrorThresholdListener"/> </listeners> </tasklet> <fail on="FAILED" exit-code="Map Reduce Step Failed"/> <end on="COMPLETED"/> <!--<next on="*" to="loadMetricsIntoHive"/>--> </step> <step id="loadMetricsIntoHive"> <tasklet ref="load-notifications"/> </step> </job>
  • 44. Spring Batch Listeners • Monitor job flow • Take action on job failure • PagerDuty notifications • Save job counters to the audit database • Notify team if counters are not consistent with historical audit data (based on thresholds) TOOLS AND TECHNOLOGIES | 44
  • 45. Spring Boot: Pulling Everything Together • Runnable jar created during build process • Controlled by Maven plugin TOOLS AND TECHNOLOGIES | 45 <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <configuration> <finalName>maas-etl-${project.version}</finalName> <classifier>spring</classifier> <mainClass>com.rackspace....JobRunner</mainClass> <excludeGroupIds>org.slf4j</excludeGroupIds> </configuration> </plugin>
  • 46. HIVE • Typical Use Cases • File formats • ORC • Abstractions • Hive in the monitoring pipeline • Query performance
  • 47. Overview • Translates SQL commands into MR jobs. • Structured and unstructured data in multiple formats • Standard access protocols, including JDBC and Thrift • Provides several serialization mechanisms • Integrates seamlessly with Hadoop: HCatalog, Pig, HBase, etc. HIVE | 47
  • 48. Hive vs. RDBMS HIVE | 48 Hive Traditional Databases SQL Interface SQL Interface Focus on batch analytics Mostly online, interactive analytics No transactions Transactions are their way of life No random inserts
 Updates are not natively supported (but possible.) Random insert and updates Distributed processing via MR Distributed processing capabilities vary Scales to hundreds of nodes Seldom scales beyond 20 nodes Built for commodity hardware Expensive, proprietary hardware Low cost per petabyte What’s a petabyte?
  • 49. Abstraction Layers in Hive 49HIVE | Database Table Partition Skewed Keys Table Partition Partition Unskewed Keys B u c k e t B u c k e t B u c k e t Optional
  • 50. Schemas and File Formats • We used the ORCFile format: built-in, easy to use and efficient. • Efficient light-weight + generic compression • Run length encoding for integers and strings, dictionary encoding, etc. • Generic compression: Snappy, LZO, and ZLib (default) • High performance • Indexes value ranges within blocks of ORCFile data • Predicate filter pushdown allows efficient scanning during queries. • Flexible Data Model • Hive types are supported including maps, structs and unions. HIVE | 50
  • 51. The ORC File Format • An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer. • Default size is 256 MB (orc.stripe.size). • Large stripes allow for efficient reads from HDFS configured independently from the block size. HIVE | 51
  • 52. The ORC File Format: Index • Doesn’t answer queries • Required for skipping rows: • Row index entries provide offsets that enable seeking • Min and max values for each column HIVE | 52
  • 53. ORC File Index Skipping HIVE | 53 Skipping works for number types and for string types. Done by recording a min and max value inside the inline index and determining if the lookup value falls outside that range.
  • 54. The ORC File Format: File Footer • List of stripes in the file, the number of rows per stripe, each column's data type. • Column-level aggregates: count, min, max, and sum. • ORC uses files footer to find the columns data streams. HIVE | 54
  • 55. Predicate Pushdowns • “Push down” parts of the query to where the data is. • filter/skip as much data as possible, and • greatly reduce input size. • Sorting a table on its secondary keys also reduces execution time. • Sorted columns are grouped together in one area on disk and the other pieces will be skipped very quickly. HIVE | 55
  • 57. Query Performance • Lower latency Hive queries rely on two major factors: • Sorting and skipping data as much as possible • Minimizing data shuffle from mappers to reducers HIVE | 57
  • 58. Improving Query Performance • Divide data among different files/directories • Partitions, buckets, etc. • Skip records using small embedded indexes. • ORCFile format. • Sort data ahead of time. • Simplifies joins making ORCFile skipping more effective. HIVE | 58
  • 59. The Big Picture DATA ENGINEERING | 59 Data Preprocessing HDFS HDFSMapReduce Start Here JSON Hive File Data Load Dynamic Load Partioning Bucketing Indexing HDFS Hive File Staging Table Prod Table Data Access API Hive CLI Apache Thrift
  • 60. THANK YOU! Get in touch: alexvsilva@gmail.com @thealexsilva