SlideShare a Scribd company logo
1 of 47
Hive at Yahoo: Letters from the trenches
P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n , C h r i s D r o m e ⎪ J u n e 1 0 , 2 0 1 5
2 0 1 5 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
About myself
2 2014 Hadoop Summit, San Jose, California
 Mithun Radhakrishnan
 Hive Engineer at Yahoo!
 Hive Committer and long-time
contributor
› Metastore-scaling
› Integration
› HCatalog
 mithun@apache.org
 @mithunrk
About myself
3 2014 Hadoop Summit, San Jose, California
 Chris Drome
 Hive Engineer at Yahoo!
 Hive contributor
 cdrome@yahoo-inc.com
Recap
5 2015 Hadoop Summit, San Jose, California
6 2015 Hadoop Summit, San Jose, California
0
500
1000
1500
2000
2500
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hiveq3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hiveq7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hive
q11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hiveq15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
TPC-h 1TB
Hive 0.10 RC File
Hive 0.11 ORC
Hive 0.12 ORC
Hive 0.13 ORC MR
Hive 0.13 ORC Tez
1 TB
7 2015 Hadoop Summit, San Jose, California
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 2.5-17x
› Average query time: 172 seconds
• Between 5-947 seconds
• Down from 729 seconds (Hive 0.10 RCFile)
› 61% queries completed in under 2 minutes
› 81% queries completed in under 4 minutes
Explaining the speed-ups
8 2015 Hadoop Summit, San Jose, California
 Hadoop 2.x, et al.
 Apache Tez
› (Arbitrary DAG)-based Execution Engine
› “Playing the gaps” between M&R
• Intermediate data and the HDFS
› Smart scheduling
› Container re-use
› Pipelined job start-up
 Hive
› Statistics
› Vectorized Execution
 ORC
› PPD
Expectations with Hive 0.13 production
9 2014 Hadoop Summit, San Jose, California
 Tez would outperform M/R by miles
 Tez would enable better cluster utilization
› Use less resources
 Tez (and dependencies) would be “production ready”
› GUI for task logs, DAG overviews, swim-lanes
› Speculative execution
 Similarly, ORC and Vectorization
› Support evolving schemas
The Y!Grid
10 2015 Hadoop Summit, San Jose, California
 18 Hadoop Clusters in YGrid
› 41565 Nodes
› Biggest cluster: 5728 Nodes
› 1M jobs a day
 Hadoop 2.6+
 Large Datasets
› Daily, hourly, minute-level frequencies
› Thousands of partitions, 100s of 1000s of files, TBs of data per partition
› 580 PB of data, total
 Pig 0.14 on Tez, Pig 0.11
 Hive 0.13 on Tez
 HCatalog for interoperability
 Oozie for scheduling
 GDM for data-loading
 Spark, HBase, Storm, etc…
Data processing use cases
11 2015 Hadoop Summit, San Jose, California
 Grid usage
› 30+ million jobs per month
› 12+ million Oozie launcher jobs
 Pig usage
› Handles majority of data pipelines/ETL (~43% of jobs)
 Hive usage
› Relatively smaller niche
› 632,000 queries per month (35% Tez)
 HCatalog for Inter-operability
› Metadata storage for all Hadoop data
› Yahoo-scale
› Pig pipelines with Hive analytics
Business Intelligence Tools
12 2015 Hadoop Summit, San Jose, California
 Tableau, MicroStrategy
 Power users
› Tableau Server for scheduled reports
 Challenges:
› Security
• ACLs, Authentication, Encryption over the wire
› Bandwidth
• Transporting results over ODBC
• Limit result-set to 1000s-10000s of rows
• Aggregations
› Query Latency
• Metadata queries
• Partition/Table scans
• Materialized views
 Data producer owns the data
› Unlike traditional DBs
 Multi-paradigm data access/generation
› Pig/Hive/MapReduce using HCatalog
 Highly available metadata service
 UI for tracking/debugging jobs
 Execution engine should ideally support speculative execution
13 2015 Hadoop Summit, San Jose, California
Non-negotiables for Hive upgrade at Yahoo!
Yahoo! Hive-0.13
14 2015 Hadoop Summit, San Jose, California
 Based on Apache Hive-0.13.1
 Internal Yahoo! Patches (admin web-services, data discovery, etc.)
 Community patches to stabilize Apache Hive-0.13.1
› Tez
• HIVE-7544, HIVE-6748, HIVE-7112, …
› Vectorization
• HIVE-8163, HIVE-8092, HIVE-7188, HIVE-7105, HIVE-7514, …
› Failures
• HIVE-7851, HIVE-7459, HIVE-7771, HIVE-7396, …
› Optimizations
• HIVE-7231, HIVE-7219, HIVE-7203, HIVE-7052, …
› Data integrity
• HIVE-7694, HIVE-7494, HIVE-7045, HIVE-7346, HIVE-7232, …
 Phased upgrades
› Phase 1: 285 JIRAs
› Phase 2: 23 JIRAs (HIVE-8781 and related dependencies)
› Phase 3: 46 JIRAs (HIVE-10114 and related dependencies)
 One remote Hive Metastore “instance”
› 4 HCatalog Servers behind a hardware VIP
• L3DSR load balancer
• 96GB-128GB RAM, 16 core boxes
› Backed by Oracle RAC
 About 10 Gateways
› Interactive use of Hive (and Pig, Oozie, M/R)
› hive.metastore.uris -> HCatalog
 About 4 HiveServer2 instances
› Ad Hoc queries, aggregation
15 2015 Hadoop Summit, San Jose, California
Hive deployment (per cluster)
Evolution of grid services at Yahoo!
16 Yahoo Confidential & Proprietary
Gateway Machines
Grid
OracleOracle RAC
Browser
HUE
Hive Server 2
BI Tools
HCatalogHCatalog
 Query performance on very large data sets
› HIVE-8292: Reading … has high overhead in MapOperator.cleanUpInputFileChangedOp
 Split-generation on very large data sets
› Tends to generate more splits (maps tasks) compared to M/R
› Long split generation times
› Hogging the Hadoop queues
• Wave factor vs multi-tenancy requirements
› HIVE-10114: Split strategies for ORC
 Scaling problems with ATS
› More of a problem with Pig workflows
› 10K+ tasks/job are routine
› AM progress reporting, heart-beating, memory usage
› Hadoop 2.6.0.10+
17 2015 Hadoop Summit, San Jose, California
Challenges experienced with Hive on Tez
18 Yahoo Confidential & Proprietary
 At Yahoo! Scale,
› 100s of Databases per cluster
› 100s of Tables per database
› 100s of columns per Table
› 1000s of Partitions per Table
• Larger tables: Thousands of partitions, per hour
• Millions of partitions every few days
• 10s of millions of partitions, over dataset retention period
 Problems:
› Metadata volume
• Database/Table/Partition IO Formats
• Record serialization details
• HDFS paths
• Statistics
– Per partition
– Per column
19 2015 Hadoop Summit, San Jose, California
Fast execution engines aren’t the whole picture
Letters from the trenches
21 2015 Hadoop Summit, San Jose, California
From: Another ETL pipeline.
To: The Yahoo Hive Team
Subject: Slow queries
YHive team,
My query fails with OutOfMemoryError. I tried increasing
container size, but it still fails. Please help!
Here are my settings:
set mapreduce.input.fileinputformat.split.maxsize=16777216;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set mapred.child.java.opts=“-Xmx1024m”
...
INSERT OVERWRITE TABLE my_table PARTITION( foo, bar, goo )
SELECT * FROM {
...
}
...
22 2015 Hadoop Summit, San Jose, California
From: YET another ETL pipeline.
To: The Yahoo Hive Team
Subject: Slow UDF performance
YHive team,
Why does using a simple custom UDF cause queries to
time out?
SELECT foo, bar, my_function( goo )
FROM my_large_table
WHERE ...
23 2015 Hadoop Summit, San Jose, California
From: The ETL team
To: The Yahoo Hive Team
Subject: A small matter of size...
Dear YHive team,
We have partitioned our table using the following
6 partition keys: {hourly-timestamp, name, property,
geo-location, shoe-size, and so on…}.
For a given timestamp, the combined cardinality of the
remaining partition-keys is about 10000/hr.
If queries on partitioned tables are supposed to
be faster, how come queries on our table take forever
just to get off the ground?
Yours gigantically,
Project Grape Ape
24 2015 Hadoop Summit, San Jose, California
25 2015 Hadoop Summit, San Jose, California
Metadata volume and Query Execution time
26 2015 Hadoop Summit, San Jose, California
 Anatomy of a Hive query
1. Compile query to AST
2. Thrift-call to Metastore, for partition list
3. Examine partitions, data-paths, etc. Construct physical query plan.
4. Run optimizers on the plan
5. Execute plan. (M/R, Tez).
 Partition pruner:
› Removes partitions that shouldn’t participate in the query.
› In effect, remove input-directories from the Hadoop job.
The problems of large-scale metadata
27 2015 Hadoop Summit, San Jose, California
 Partition pruner is single-threaded
› Query spans a day
› Query spanning a week? 2 million partitions
 Partition objects are huge:
› HDFS Paths
› IO Formats
› Record Deserializer info
› Data column schema
 Datanucleus:
› 1 Partition: Join 6 Oracle tables in the backend.
 Thrift serialization/deserialization takes minutes.
› *Minutes*.
Immediate workarounds
28 2015 Hadoop Summit, San Jose, California
 “Hive wasn’t originally designed for more than 10000s of partitions,
total…”
 Throw hardware at it
› 4 HCatalog servers behind a hardware VIP
› High-RAM boxes:
• 96GB-128 GB metastore processes
• Tune each to use 100 connections to the Oracle RAC
 Client-side tuning
› Increase hive.metastore.client.socket.timeout
› Increase heap size as needed (container size)
› Multi-threaded fstat operations
Fix the leaky/noisy bits
29 2015 Hadoop Summit, San Jose, California
 Metastore frequently ran out of memory:
› Disable Hadoop FileSystem cache
• HIVE-3098, HDFS-3545
• FileSystem.CACHE used UGI.hashcode()
– Compared Subjects for equality, not equivalence.
› Fixed Thrift 0.9
• TSaslServerTransport had circular references
• JVM couldn’t detect these for cleanup
– WeakReferences are your friend
• Fix incompatibility with L3DSR pings
 Data discovery from Oozie:
› Use JMS notifications, on publication
› Oozie Coordinators wake up on ActiveMQ notification, kick off dependent workflows
› Reduced polling frequency
More fixes
30 2015 Hadoop Summit, San Jose, California
 Metadata-only queries:
› SELECT DISTINCT tstamp FROM my_purple_table ORDER BY tstamp DESC LIMIT
1000;
› Replace HiveMetaStoreClient::getPartitions() with getPartitionNames().
› Local job, versus cluster.
 Optimize the optimizer:
› The first step in some optimizers:
• List<Partition> partitions = hiveMetaStoreClient.getPartitions( db, table,
(short)-1 );
• Pray that the client and/or the metastore don’t run out of memory.
• Take a nap.
› Fixed PartitionPruner, MetadataOnlyOptimizer.
Long-term fixes:
31 2015 Hadoop Summit, San Jose, California
 DirectSQL short-circuits:
› Datanucleus problems at scale
• (Yes, we are aware of the irony that might result from extrapolation.)
› Specific to the backing DB.
 Compaction of Partition info:
› HIVE-7223, HIVE-7576, HIVE-9845, etc.
› Schema evolves infrequently
› Partition-info rarely differs from table-info
– Except HDFS paths (which are super-strings)
› List<Partition> vs Iterator<Partition>
• PartitionSet abstraction
– The delight of Inheritance in Thrift
• Reduced memory foot-prints
32 2015 Hadoop Summit, San Jose, California
“The finest trick of The Devil was to
persuade you that he does not exist.”
-- ???
33 2015 Hadoop Summit, San Jose, California
34 2015 Hadoop Summit, San Jose, California
35 2015 Hadoop Summit, San Jose, California
From: A major reporting team
To: The Yahoo Hive Team
Subject: Urgent! Customer reports are borking.
Dear YHive team,
When we connect Tableau Server 8.3 to Y!Hive
0.12/0.13, it is unusably slow. Queries take too long
to run, and time out.
We’d prefer not to change our query-code too
much. How soon can Hive accommodate our simple queries?
Yours hysterically,
Project Zodiac
36 2015 Hadoop Summit, San Jose, California
Analysis: The query
37 2015 Hadoop Summit, San Jose, California
 Non-const partition key predicates:
› E.g.
WHERE utc_time <= from_unixtime(unix_timestamp()- 2*24*60*60,
'yyyyMMdd')
AND utc_time >= from_unixtime(unix_timestamp()- 32*24*60*60,
'yyyyMMdd')
› Solution: Use constant expressions where possible.
› Fix: Hive 1.x supports dynamic partition pruning, and constant folding.
 Costly joins with partitioned dimension tables:
› E.g.
› SELECT … FROM fact_table JOIN (SELECT * FROM dimension_table
WHERE dt IN (SELECT MAX(dt) from dimension_table);
› Workaround: External “pointer” tables.
› Fix: Dynamic partition pruning.
Analysis: The data
38 2015 Hadoop Summit, San Jose, California
 Data stored in TEXTFILE
› Solution: Switch to columnar storage
• ORC, dictionary encoding, vectorization, predicate pushdown
 Over-partitioning:
› Too many partition keys
› Diminishing returns with partition pruning
› Solution: Eliminate partition keys, consider sorting
 Small Part files
› Hard-coded nReducers
› E.g.
hive> dfs -count /projects/foo_stats;
9081 682735 1876847648672 /projects/foo.db/foo_stats
› Solution:
• set hive.merge.mapfiles=true;
• set hive.merge.mapredfiles=true;
• set hive.merge.tezfiles=true;
We’re not done yet
39 2015 Hadoop Summit San Jose
 Tez/ATS scaling
 Speed up split calculation
 Auto/Offline compaction
 Abuse detection
 Better handling of schema
evolution
 Skew Joins in Hive
 UDFs with JNI and configuring
LD_LIBRARY_PATH
Questions?
Backup
YHive configuration settings:
42 2014 Hadoop Summit, San Jose, California
set hive.merge.mapfiles=false; -- Except when producing data.
set hive.merge.mapredfiles=false; -- Except when producing data.
set tez.merge.files=false; -- Except when producing data.
-- For ORC files.
-- dfs.blocksize=134217728; -- hdfs-site.xml
set orc.stripe.size=67108864; -- 64MB stripes.
set orc.compress.size=262144; -- 256KB compress buffer.
set orc.compress=ZLIB; -- Override to NONE, per table.
set orc.create.index=true; -- ORC indexes.
set orc.optimize.index.filter=true; -- Predicate pushdown with ORC index
set orc.row.index.stride=10000;
YHive configuration settings: (contd)
43 2014 Hadoop Summit, San Jose, California
-- Delegation Token Store settings:
set hive.cluster.delegation.token.store.class=ZooKeeperTokenStore;
set hive.cluster.delegation.token.renew-interval=172800000;
(Start HCat Server with -Djute.maxbuffer=24MB -> 190K+ tokens.)
-- Data Nucleus settings:
set datanucleus.connectionPoolingType=DBCP; -- !(BoneCP).
set datanucleus.cache.level1.type=none;
set datanucleus.cache.level2.type=none;
set datanucleus.connectionPool.maxWait=200000;
set datanucleus.connectionPool.minIdle=0;
-- Misc.
set hive.metastore.event.listeners=com.yahoo.custom.JMSListener;
Zookeeper Token Storage performance
44 2014 Hadoop Summit, San Jose, California
Jute Buffer Size (in MB) Max delegation token count
4MB 30K
8MB 60K
12MB 90K
16MB 130K
20MB 160K
24MB 190K
45 2015 Hadoop Summit, San Jose, California
Why Hive on Tez?
46 2015 Hadoop Summit, San Jose, California
 Shark, Impala
› Pre-emption for in-memory systems
› Multi-tenant, shared clusters
› Heterogeneous nodes
› Existing ecosystem
› Community-driven development
 Shark
› Good proof of concept, but was not production ready
› Shuffle performance
› Hive on Spark – under active development
Analysis: Tableau/ODBC driver
47 2015 Hadoop Summit, San Jose, California
 Tableau has come a long way, but
› Schema discovery
• SELECT * FROM my_large_table LIMIT 0;
• SELECT DISTINCT part_key FROM my_large_table;
› SQL dialect
• Depends on vendor-specific driver-name
› Schema metadata-scans
• 3 partition listings per query
› Miscellaneous problems:
• “Custom SQL” rewrites
• Trouble with quoting
 tl;dr : Try to transition to Simba’s 2.0.x Drivers with Tableau 8.3.x

More Related Content

What's hot

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 

What's hot (20)

February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 

Viewers also liked

APAC Big Data Strategy RadhaKrishna Hiremane
APAC Big Data  Strategy RadhaKrishna  HiremaneAPAC Big Data  Strategy RadhaKrishna  Hiremane
APAC Big Data Strategy RadhaKrishna HiremaneIntelAPAC
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
HBase at Mendeley
HBase at MendeleyHBase at Mendeley
HBase at MendeleyDan Harvey
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 

Viewers also liked (9)

APAC Big Data Strategy RadhaKrishna Hiremane
APAC Big Data  Strategy RadhaKrishna  HiremaneAPAC Big Data  Strategy RadhaKrishna  Hiremane
APAC Big Data Strategy RadhaKrishna Hiremane
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
HBase at Mendeley
HBase at MendeleyHBase at Mendeley
HBase at Mendeley
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 

Similar to Hive at Yahoo: Letters from the trenches

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 

Similar to Hive at Yahoo: Letters from the trenches (20)

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Hive at Yahoo: Letters from the trenches

  • 1. Hive at Yahoo: Letters from the trenches P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n , C h r i s D r o m e ⎪ J u n e 1 0 , 2 0 1 5 2 0 1 5 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
  • 2. About myself 2 2014 Hadoop Summit, San Jose, California  Mithun Radhakrishnan  Hive Engineer at Yahoo!  Hive Committer and long-time contributor › Metastore-scaling › Integration › HCatalog  mithun@apache.org  @mithunrk
  • 3. About myself 3 2014 Hadoop Summit, San Jose, California  Chris Drome  Hive Engineer at Yahoo!  Hive contributor  cdrome@yahoo-inc.com
  • 5. 5 2015 Hadoop Summit, San Jose, California
  • 6. 6 2015 Hadoop Summit, San Jose, California 0 500 1000 1500 2000 2500 q1_pricing_summary_report.hive q2_minimum_cost_supplier.hiveq3_shipping_priority.hive q4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hiveq7_volume_shipping.hive q8_na onal_market_share.hive q9_product_type_profit.hiveq10_returned_item.hive q11_important_stock.hive q12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hiveq15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) TPC-h 1TB Hive 0.10 RC File Hive 0.11 ORC Hive 0.12 ORC Hive 0.13 ORC MR Hive 0.13 ORC Tez
  • 7. 1 TB 7 2015 Hadoop Summit, San Jose, California › 6.2x speedup over Hive 0.10 (RCFile) • Between 2.5-17x › Average query time: 172 seconds • Between 5-947 seconds • Down from 729 seconds (Hive 0.10 RCFile) › 61% queries completed in under 2 minutes › 81% queries completed in under 4 minutes
  • 8. Explaining the speed-ups 8 2015 Hadoop Summit, San Jose, California  Hadoop 2.x, et al.  Apache Tez › (Arbitrary DAG)-based Execution Engine › “Playing the gaps” between M&R • Intermediate data and the HDFS › Smart scheduling › Container re-use › Pipelined job start-up  Hive › Statistics › Vectorized Execution  ORC › PPD
  • 9. Expectations with Hive 0.13 production 9 2014 Hadoop Summit, San Jose, California  Tez would outperform M/R by miles  Tez would enable better cluster utilization › Use less resources  Tez (and dependencies) would be “production ready” › GUI for task logs, DAG overviews, swim-lanes › Speculative execution  Similarly, ORC and Vectorization › Support evolving schemas
  • 10. The Y!Grid 10 2015 Hadoop Summit, San Jose, California  18 Hadoop Clusters in YGrid › 41565 Nodes › Biggest cluster: 5728 Nodes › 1M jobs a day  Hadoop 2.6+  Large Datasets › Daily, hourly, minute-level frequencies › Thousands of partitions, 100s of 1000s of files, TBs of data per partition › 580 PB of data, total  Pig 0.14 on Tez, Pig 0.11  Hive 0.13 on Tez  HCatalog for interoperability  Oozie for scheduling  GDM for data-loading  Spark, HBase, Storm, etc…
  • 11. Data processing use cases 11 2015 Hadoop Summit, San Jose, California  Grid usage › 30+ million jobs per month › 12+ million Oozie launcher jobs  Pig usage › Handles majority of data pipelines/ETL (~43% of jobs)  Hive usage › Relatively smaller niche › 632,000 queries per month (35% Tez)  HCatalog for Inter-operability › Metadata storage for all Hadoop data › Yahoo-scale › Pig pipelines with Hive analytics
  • 12. Business Intelligence Tools 12 2015 Hadoop Summit, San Jose, California  Tableau, MicroStrategy  Power users › Tableau Server for scheduled reports  Challenges: › Security • ACLs, Authentication, Encryption over the wire › Bandwidth • Transporting results over ODBC • Limit result-set to 1000s-10000s of rows • Aggregations › Query Latency • Metadata queries • Partition/Table scans • Materialized views
  • 13.  Data producer owns the data › Unlike traditional DBs  Multi-paradigm data access/generation › Pig/Hive/MapReduce using HCatalog  Highly available metadata service  UI for tracking/debugging jobs  Execution engine should ideally support speculative execution 13 2015 Hadoop Summit, San Jose, California Non-negotiables for Hive upgrade at Yahoo!
  • 14. Yahoo! Hive-0.13 14 2015 Hadoop Summit, San Jose, California  Based on Apache Hive-0.13.1  Internal Yahoo! Patches (admin web-services, data discovery, etc.)  Community patches to stabilize Apache Hive-0.13.1 › Tez • HIVE-7544, HIVE-6748, HIVE-7112, … › Vectorization • HIVE-8163, HIVE-8092, HIVE-7188, HIVE-7105, HIVE-7514, … › Failures • HIVE-7851, HIVE-7459, HIVE-7771, HIVE-7396, … › Optimizations • HIVE-7231, HIVE-7219, HIVE-7203, HIVE-7052, … › Data integrity • HIVE-7694, HIVE-7494, HIVE-7045, HIVE-7346, HIVE-7232, …  Phased upgrades › Phase 1: 285 JIRAs › Phase 2: 23 JIRAs (HIVE-8781 and related dependencies) › Phase 3: 46 JIRAs (HIVE-10114 and related dependencies)
  • 15.  One remote Hive Metastore “instance” › 4 HCatalog Servers behind a hardware VIP • L3DSR load balancer • 96GB-128GB RAM, 16 core boxes › Backed by Oracle RAC  About 10 Gateways › Interactive use of Hive (and Pig, Oozie, M/R) › hive.metastore.uris -> HCatalog  About 4 HiveServer2 instances › Ad Hoc queries, aggregation 15 2015 Hadoop Summit, San Jose, California Hive deployment (per cluster)
  • 16. Evolution of grid services at Yahoo! 16 Yahoo Confidential & Proprietary Gateway Machines Grid OracleOracle RAC Browser HUE Hive Server 2 BI Tools HCatalogHCatalog
  • 17.  Query performance on very large data sets › HIVE-8292: Reading … has high overhead in MapOperator.cleanUpInputFileChangedOp  Split-generation on very large data sets › Tends to generate more splits (maps tasks) compared to M/R › Long split generation times › Hogging the Hadoop queues • Wave factor vs multi-tenancy requirements › HIVE-10114: Split strategies for ORC  Scaling problems with ATS › More of a problem with Pig workflows › 10K+ tasks/job are routine › AM progress reporting, heart-beating, memory usage › Hadoop 2.6.0.10+ 17 2015 Hadoop Summit, San Jose, California Challenges experienced with Hive on Tez
  • 18. 18 Yahoo Confidential & Proprietary
  • 19.  At Yahoo! Scale, › 100s of Databases per cluster › 100s of Tables per database › 100s of columns per Table › 1000s of Partitions per Table • Larger tables: Thousands of partitions, per hour • Millions of partitions every few days • 10s of millions of partitions, over dataset retention period  Problems: › Metadata volume • Database/Table/Partition IO Formats • Record serialization details • HDFS paths • Statistics – Per partition – Per column 19 2015 Hadoop Summit, San Jose, California Fast execution engines aren’t the whole picture
  • 20. Letters from the trenches
  • 21. 21 2015 Hadoop Summit, San Jose, California From: Another ETL pipeline. To: The Yahoo Hive Team Subject: Slow queries YHive team, My query fails with OutOfMemoryError. I tried increasing container size, but it still fails. Please help! Here are my settings: set mapreduce.input.fileinputformat.split.maxsize=16777216; set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=4096; set mapred.child.java.opts=“-Xmx1024m” ... INSERT OVERWRITE TABLE my_table PARTITION( foo, bar, goo ) SELECT * FROM { ... } ...
  • 22. 22 2015 Hadoop Summit, San Jose, California From: YET another ETL pipeline. To: The Yahoo Hive Team Subject: Slow UDF performance YHive team, Why does using a simple custom UDF cause queries to time out? SELECT foo, bar, my_function( goo ) FROM my_large_table WHERE ...
  • 23. 23 2015 Hadoop Summit, San Jose, California
  • 24. From: The ETL team To: The Yahoo Hive Team Subject: A small matter of size... Dear YHive team, We have partitioned our table using the following 6 partition keys: {hourly-timestamp, name, property, geo-location, shoe-size, and so on…}. For a given timestamp, the combined cardinality of the remaining partition-keys is about 10000/hr. If queries on partitioned tables are supposed to be faster, how come queries on our table take forever just to get off the ground? Yours gigantically, Project Grape Ape 24 2015 Hadoop Summit, San Jose, California
  • 25. 25 2015 Hadoop Summit, San Jose, California
  • 26. Metadata volume and Query Execution time 26 2015 Hadoop Summit, San Jose, California  Anatomy of a Hive query 1. Compile query to AST 2. Thrift-call to Metastore, for partition list 3. Examine partitions, data-paths, etc. Construct physical query plan. 4. Run optimizers on the plan 5. Execute plan. (M/R, Tez).  Partition pruner: › Removes partitions that shouldn’t participate in the query. › In effect, remove input-directories from the Hadoop job.
  • 27. The problems of large-scale metadata 27 2015 Hadoop Summit, San Jose, California  Partition pruner is single-threaded › Query spans a day › Query spanning a week? 2 million partitions  Partition objects are huge: › HDFS Paths › IO Formats › Record Deserializer info › Data column schema  Datanucleus: › 1 Partition: Join 6 Oracle tables in the backend.  Thrift serialization/deserialization takes minutes. › *Minutes*.
  • 28. Immediate workarounds 28 2015 Hadoop Summit, San Jose, California  “Hive wasn’t originally designed for more than 10000s of partitions, total…”  Throw hardware at it › 4 HCatalog servers behind a hardware VIP › High-RAM boxes: • 96GB-128 GB metastore processes • Tune each to use 100 connections to the Oracle RAC  Client-side tuning › Increase hive.metastore.client.socket.timeout › Increase heap size as needed (container size) › Multi-threaded fstat operations
  • 29. Fix the leaky/noisy bits 29 2015 Hadoop Summit, San Jose, California  Metastore frequently ran out of memory: › Disable Hadoop FileSystem cache • HIVE-3098, HDFS-3545 • FileSystem.CACHE used UGI.hashcode() – Compared Subjects for equality, not equivalence. › Fixed Thrift 0.9 • TSaslServerTransport had circular references • JVM couldn’t detect these for cleanup – WeakReferences are your friend • Fix incompatibility with L3DSR pings  Data discovery from Oozie: › Use JMS notifications, on publication › Oozie Coordinators wake up on ActiveMQ notification, kick off dependent workflows › Reduced polling frequency
  • 30. More fixes 30 2015 Hadoop Summit, San Jose, California  Metadata-only queries: › SELECT DISTINCT tstamp FROM my_purple_table ORDER BY tstamp DESC LIMIT 1000; › Replace HiveMetaStoreClient::getPartitions() with getPartitionNames(). › Local job, versus cluster.  Optimize the optimizer: › The first step in some optimizers: • List<Partition> partitions = hiveMetaStoreClient.getPartitions( db, table, (short)-1 ); • Pray that the client and/or the metastore don’t run out of memory. • Take a nap. › Fixed PartitionPruner, MetadataOnlyOptimizer.
  • 31. Long-term fixes: 31 2015 Hadoop Summit, San Jose, California  DirectSQL short-circuits: › Datanucleus problems at scale • (Yes, we are aware of the irony that might result from extrapolation.) › Specific to the backing DB.  Compaction of Partition info: › HIVE-7223, HIVE-7576, HIVE-9845, etc. › Schema evolves infrequently › Partition-info rarely differs from table-info – Except HDFS paths (which are super-strings) › List<Partition> vs Iterator<Partition> • PartitionSet abstraction – The delight of Inheritance in Thrift • Reduced memory foot-prints
  • 32. 32 2015 Hadoop Summit, San Jose, California “The finest trick of The Devil was to persuade you that he does not exist.” -- ???
  • 33. 33 2015 Hadoop Summit, San Jose, California
  • 34. 34 2015 Hadoop Summit, San Jose, California
  • 35. 35 2015 Hadoop Summit, San Jose, California
  • 36. From: A major reporting team To: The Yahoo Hive Team Subject: Urgent! Customer reports are borking. Dear YHive team, When we connect Tableau Server 8.3 to Y!Hive 0.12/0.13, it is unusably slow. Queries take too long to run, and time out. We’d prefer not to change our query-code too much. How soon can Hive accommodate our simple queries? Yours hysterically, Project Zodiac 36 2015 Hadoop Summit, San Jose, California
  • 37. Analysis: The query 37 2015 Hadoop Summit, San Jose, California  Non-const partition key predicates: › E.g. WHERE utc_time <= from_unixtime(unix_timestamp()- 2*24*60*60, 'yyyyMMdd') AND utc_time >= from_unixtime(unix_timestamp()- 32*24*60*60, 'yyyyMMdd') › Solution: Use constant expressions where possible. › Fix: Hive 1.x supports dynamic partition pruning, and constant folding.  Costly joins with partitioned dimension tables: › E.g. › SELECT … FROM fact_table JOIN (SELECT * FROM dimension_table WHERE dt IN (SELECT MAX(dt) from dimension_table); › Workaround: External “pointer” tables. › Fix: Dynamic partition pruning.
  • 38. Analysis: The data 38 2015 Hadoop Summit, San Jose, California  Data stored in TEXTFILE › Solution: Switch to columnar storage • ORC, dictionary encoding, vectorization, predicate pushdown  Over-partitioning: › Too many partition keys › Diminishing returns with partition pruning › Solution: Eliminate partition keys, consider sorting  Small Part files › Hard-coded nReducers › E.g. hive> dfs -count /projects/foo_stats; 9081 682735 1876847648672 /projects/foo.db/foo_stats › Solution: • set hive.merge.mapfiles=true; • set hive.merge.mapredfiles=true; • set hive.merge.tezfiles=true;
  • 39. We’re not done yet 39 2015 Hadoop Summit San Jose  Tez/ATS scaling  Speed up split calculation  Auto/Offline compaction  Abuse detection  Better handling of schema evolution  Skew Joins in Hive  UDFs with JNI and configuring LD_LIBRARY_PATH
  • 42. YHive configuration settings: 42 2014 Hadoop Summit, San Jose, California set hive.merge.mapfiles=false; -- Except when producing data. set hive.merge.mapredfiles=false; -- Except when producing data. set tez.merge.files=false; -- Except when producing data. -- For ORC files. -- dfs.blocksize=134217728; -- hdfs-site.xml set orc.stripe.size=67108864; -- 64MB stripes. set orc.compress.size=262144; -- 256KB compress buffer. set orc.compress=ZLIB; -- Override to NONE, per table. set orc.create.index=true; -- ORC indexes. set orc.optimize.index.filter=true; -- Predicate pushdown with ORC index set orc.row.index.stride=10000;
  • 43. YHive configuration settings: (contd) 43 2014 Hadoop Summit, San Jose, California -- Delegation Token Store settings: set hive.cluster.delegation.token.store.class=ZooKeeperTokenStore; set hive.cluster.delegation.token.renew-interval=172800000; (Start HCat Server with -Djute.maxbuffer=24MB -> 190K+ tokens.) -- Data Nucleus settings: set datanucleus.connectionPoolingType=DBCP; -- !(BoneCP). set datanucleus.cache.level1.type=none; set datanucleus.cache.level2.type=none; set datanucleus.connectionPool.maxWait=200000; set datanucleus.connectionPool.minIdle=0; -- Misc. set hive.metastore.event.listeners=com.yahoo.custom.JMSListener;
  • 44. Zookeeper Token Storage performance 44 2014 Hadoop Summit, San Jose, California Jute Buffer Size (in MB) Max delegation token count 4MB 30K 8MB 60K 12MB 90K 16MB 130K 20MB 160K 24MB 190K
  • 45. 45 2015 Hadoop Summit, San Jose, California
  • 46. Why Hive on Tez? 46 2015 Hadoop Summit, San Jose, California  Shark, Impala › Pre-emption for in-memory systems › Multi-tenant, shared clusters › Heterogeneous nodes › Existing ecosystem › Community-driven development  Shark › Good proof of concept, but was not production ready › Shuffle performance › Hive on Spark – under active development
  • 47. Analysis: Tableau/ODBC driver 47 2015 Hadoop Summit, San Jose, California  Tableau has come a long way, but › Schema discovery • SELECT * FROM my_large_table LIMIT 0; • SELECT DISTINCT part_key FROM my_large_table; › SQL dialect • Depends on vendor-specific driver-name › Schema metadata-scans • 3 partition listings per query › Miscellaneous problems: • “Custom SQL” rewrites • Trouble with quoting  tl;dr : Try to transition to Simba’s 2.0.x Drivers with Tableau 8.3.x

Editor's Notes

  1. TODO: Update latest profile pic
  2. TODO: Update latest profile pic
  3. At last year’s talk, which was received so enthusiastically.
  4. Tez : Scheduling. Playing the gaps, like Beethoven’s Fifth.
  5. Why 13? Why move from 12?
  6. 10000s of files? Spark, HBase
  7. Talk up the work from Gemini. Power-users of Tableau Server. People with RDBMS expertise think Partitions are analogous to Indexes. The more you have, the faster the query should run.
  8. Talk up the work from Gemini. Power-users of Tableau Server. People with RDBMS expertise think Partitions are analogous to Indexes. The more you have, the faster the query should run.
  9. Add diagram for deployment of Hive, and its evolution. Describe the problem with
  10. Add diagram for deployment of Hive, and its evolution.
  11. Last year saw a tonne of benchmarketing. Tez vs Spark (vs Impala). We’ve had several choices of execution engines. But we seem to have forgotten to scale a crucial part of the system. The metastore.
  12. Talk about the kinds of metadata: Input/Output formats, per table, per partition. Record format information. SerDe classes. Data paths Table/Partition level statistics: Also mention the Hundreds of columns per table.
  13. Small split-size.
  14. My_function() is a webservice call. hive.log.incremental.plan.progress.
  15. This table is our largest. We use this to test and break our system.
  16. Focus on data-paths.
  17. Interesting segue: The “short” nPartitions parameter.
  18. Interesting segue: The “short” nPartitions parameter.
  19. Interesting segue: The “short” nPartitions parameter.
  20. Elaborate the problems with datanucleus at scale: Thread safety Memory usage Performance Schema evolution can happen both at a geological pace, as well as a tectonic scale. Inheritance in Thrift is like implementing it in C. Mention that similar changes were made in Pig/HCatalog, for compressing Partition info. 26x storage saving (for split meta-info), + 10x faster for the query to start.
  21. The Java anecdote.
  22. Verbal Kint.
  23. Bonus: The rooftop scene in Sherlock 2.3.
  24. Charles Baudelaire. The Java anecdote.
  25. Introduce the beast that is Tableau. Flash the “simple” query.
  26. Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.
  27. Talk about distcp –pgrub, for ORC files.
  28. At last year’s talk, which was received so enthusiastically.
  29. Shark was a good proof-of-concept, but was not production ready.
  30. Praise the work from Simba. Rework slide. Too much info. Just put the TLDR. SQL dialect Depends on vendor-specific driver-name Schema metadata-scans 3 partition listings per query Miscellaneous problems: “Custom SQL” rewrites Trouble with quoting