This document summarizes the evolution of Hive, a data warehouse infrastructure built on top of Hadoop. It discusses Hive's origins at Facebook to manage large, unstructured data. Key points include Hive now functioning as a parallel SQL database using Hadoop for storage and execution. The document outlines new features in versions 0.6 and 0.7 like views, dynamic partitioning, and pluggable indexing. It also discusses Hive's roadmap for testing, performance improvements, and new capabilities.
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
In the last eight years, the Hadoop grid infrastructure has allowed us to move towards a unified source of truth for all data at Yahoo that now accounts for over 450 petabytes of raw HDFS and 1.1 billion data files. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs have become critical with the increasing scale of operations.
In this talk, we will share our approach in tackling the above challenges with Apache HCatalog, a table and storage management layer for Hadoop. We will explain how to register existing HDFS files into HCatalog, provide broader but controlled access to data through a data discovery tool, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data into HCatalog. In addition, the approach allows ever improving Hive performance to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools.
As we discuss our approach, we will also highlight along how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
In the last eight years, the Hadoop grid infrastructure has allowed us to move towards a unified source of truth for all data at Yahoo that now accounts for over 450 petabytes of raw HDFS and 1.1 billion data files. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs have become critical with the increasing scale of operations.
In this talk, we will share our approach in tackling the above challenges with Apache HCatalog, a table and storage management layer for Hadoop. We will explain how to register existing HDFS files into HCatalog, provide broader but controlled access to data through a data discovery tool, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data into HCatalog. In addition, the approach allows ever improving Hive performance to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools.
As we discuss our approach, we will also highlight along how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
Hive Training -- Motivations and Real World Use Casesnzhang
Hive is an open source data warehouse systems based on Hadoop, a MapReduce implementation.
This presentation introduces the motivations of developing Hive and how Hive is used in the real world situation, particularly in Facebook.
Big Data and New Challenges for DBAs (Michael Naumov, LivePerson)
Hadoop has become a popular platform for managing large datasets of structured and unstructured data. It does not replace existing infrastructures, but instead augments them. Most companies will still use relational databases for transactional processing and low-latency queries, but can benefit from Hadoop for reporting, machine learning or ETL. This session will cover:
What is Hadoop and why do I care?
What do people do with Hadoop?
How can SQL Server DBAs add Hadoop to their architecture?
Presented at the SPIFFE Meetup in Tokyo.
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures.
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures that provides options to run multi-environments with a single access control model.
Jithin Emmanuel, Sr. Software Development Manager, Developer Platform Services, provides an overview of Screwdriver (http://www.screwdriver.cd), and shares how it’s used at scale for CI/CD at Oath. Jithin leads the product development and operations of Screwdriver, which is a flagship CI/CD product used at scale in Oath.
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request?
This presentation introduces Vespa (http://vespa.ai) – the open source big data serving engine.
Vespa allows you to search, organize, and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents and was recently open sourced at http://vespa.ai.
In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler:
Global Scheduling Support
General placement support
Better preemption model to handle resource anomalies across and within queue.
Absolute resources’ configuration support
Priority support between Queues and Applications
In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today.
Speakers:
Sunil Govind (Hortonworks), Jian He (Hortonworks)
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data.
A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark and TensorFlowOnSpark, available as open source.
In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning.
Speakers:
Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
Spark and SQL-on-Hadoop have made it easier than ever for enterprises to create or migrate apps to the big data stack. Thousands of apps are being generated every day in the form of ETL and modeling pipelines, business intelligence and data cubes, deep machine learning, graph analytics, and real-time data streaming. However, the task of reliably operationalizing these big data apps involves many painpoints. Developers may not have the experience in distributed systems to tune apps for efficiency and performance. Diagnosing failures or unpredictable performance of apps can be a laborious process that involves multiple people. Apps may get stuck or steal resources and cause mission-critical apps to miss SLAs.
This talk with introduce the audience to these problems and their common causes. We will also demonstrate how to find and fix these problems quickly, as well as prevent such problems from happening in the first place.
Speakers:
Dr. Shivnath Babu is a Co-founder and CTO of Unravel and Associate Professor of Computer Science at Duke University. With more than a decade of experience researching the ease of use and manageability of data-intensive systems, he leads the Starfish project at Duke, which pioneered the automation of Hadoop application tuning, problem diagnosis, and resource management. Shivnath has more than 80 peer-reviewed publications to his credit and has received the U.S. National Science Foundation CAREER Award, the HP Labs Innovation Award, and three IBM Faculty Awards.
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees.
Speakers:
Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
In the analysis of big data there are problematic queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of sub-linear, stochastic streaming algorithms, called "sketches", that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of extracting results for these problem queries in real-time, sketches are the only known solution. For any analysis system that requires these problematic queries from big data, sketches are a required toolkit that should be tightly integrated into the system's analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours, or minutes to seconds on a number of its internal platforms. This talk covers the current state of our Open Source DataSketches.github.io library, which includes adaptations and example code for Pig, Hive, Spark and Druid and gives architectural examples of use and a case study.
Speakers:
Jon Malkin is a scientist at Yahoo working to extend the DataSketches library. His previous roles have involved large scale data processing for sponsored search, display advertising, user counting, ad targeting, and cross-device user identity modeling.
Alexander Saydakov is a senior software engineer at Yahoo working on the open source Data Sketches project. In his previous roles he has been involved in building large-scale back-end data processing systems and frameworks for data analytics and experimentation based on Torque, Hadoop, Pig, Hive and Druid. Alexander’s education background is in the field of applied mathematics.
2. Agenda Hive Overview Version 0.6 (released!) Version 0.7 (under development) Hive is now a TLP! Roadmaps
3. What is Hive? A Hadoop-based system for querying and managing structured data Uses Map/Reduce for execution Uses Hadoop Distributed File System (HDFS) for storage
4. Hive Origins Data explosion at Facebook Traditional DBMS technology could not keep up with the growth Hadoop to the rescue! Incubation with ASF, then became a Hadoop sub-project Now a top-level ASF project
5. SQL vs MapReduce hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2""$1}‘ $ cat > /tmp/map.sh awk -F '01' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs –cat /tmp/largekey/part*
6. Hive Evolution Originally: a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs Now more and more: A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture
7. Intended Usage Web-scale Big Data 100’s of terabytes Large Hadoop cluster 100’s of nodes (heterogeneous OK) Data has a schema Batch jobs for both loads and queries
8. So Don’t Use Hive If… Your data is measured in GB You don’t want to impose a schema You need responses in seconds A “conventional” analytic DBMS can already do the job (and you can afford it) You don’t have a lot of time and smart people
9. Scaling Up Facebook warehouse, Jan 2011: 2750 nodes 30 petabytes disk space Data access per day: ~40 terabytes added (compressed) 25000 map/reduce jobs 300-400 users/month
10. Facebook Deployment Web Servers Scribe MidTier Scribe-Hadoop Clusters Hive Replication Production Hive-Hadoop Cluster Archival Hive-Hadoop Cluster Adhoc Hive-Hadoop Cluster Sharded MySQL
13. Column Data Types Primitive Types integer types, float, string, boolean Nest-able Collections array<any-type> map<primitive-type, any-type> User-defined types structures with attributes which can be of any-type
14. Hive Query Language DDL {create/alter/drop} {table/view/partition} create table as select DML Insert overwrite QL Sub-queries in from clause Equi-joins (including Outer joins) Multi-table Insert Sampling Lateral Views Interfaces JDBC/ODBC/Thrift
15. Query Translation Example SELECT url, count(*) FROM page_views GROUP BY url Map tasks compute partial counts for each URL in a hash table “map side” pre-aggregation map outputs are partitioned by URL and shipped to corresponding reducers Reduce tasks tally up partial counts to produce final results
16. FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds='2009-03-20' ) ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds='2009-03-20') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school
18. Behavior Extensibility TRANSFORM scripts (any language) Serialization+IPC overhead User defined functions (Java) In-process, lazy object evaluation Pre/Post Hooks (Java) Statement validation/execution Example uses: auditing, replication, authorization, multiple clusters
19. Map/Reduce Scripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(user_id, page_url, unix_time) USING 'page_url_to_id.py' AS (user_id, page_id, unix_time) FROM mylog DISTRIBUTE BY user_id SORT BY user_id, unix_time) mylog2 SELECT TRANSFORM(user_id, page_id, unix_time) USING 'my_python_session_cutter.py' AS (user_id, session_info);
20. UDF vs UDAF vs UDTF User Defined Function One-to-one row mapping Concat(‘foo’, ‘bar’) User Defined Aggregate Function Many-to-one row mapping Sum(num_ads) User Defined Table Function One-to-many row mapping Explode([1,2,3])
21. UDF Example add jar build/ql/test/test-udfs.jar; CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.UDFTestLength'; SELECT testlength(src.value) FROM src; DROP TEMPORARY FUNCTION testlength; UDFTestLength.java: package org.apache.hadoop.hive.ql.udf; public class UDFTestLength extends UDF { public Integer evaluate(String s) { if (s == null) { return null; } return s.length(); } }
22. Storage Extensibility Input/OutputFormat: file formats SequenceFile, RCFile, TextFile, … SerDe: row formats Thrift, JSON, ProtocolBuffer, … Storage Handlers (new in 0.6) Integrate foreign metadata, e.g. HBase Indexing Under development in 0.7
23. Release 0.6 October 2010 Views Multiple Databases Dynamic Partitioning Automatic Merge New Join Strategies Storage Handlers
24. Dynamic Partitions Automatically create partitions based on distinct values in columns INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country FROM page_view_stg pvs
25. Automatic merge Jobs can produce many files Why is this bad? Namenode pressure Downstream jobs have to deal with file processing overhead So, clean up by merging results into a few large files (configurable) Use conditional map-only task to do this
26. Join Strategies Old Join Strategies Map-reduce and Map Join Bucketed map-join Allows “small” table to be much bigger Sort Merge Map Join Deal with skew in map/reduce join Conditional plan step for skewed keys
27. Storage Handler Syntax HBase Example CREATE TABLE users( userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( “hbase.columns.mapping” = “small:name,small:email,large:notes”) TBLPROPERTIES ( “hbase.table.name” = “user_list”);
28. Release 0.7 Deployed in Facebook Stats Functions Indexes Local Mode Automatic Map Join Multiple DISTINCTs Archiving In development Concurrency Control Stats Collection J/ODBC Enhancements Authorization RCFile2 Partitioned Views Security Enhancements
29. Statistical Functions Stats 101 Stddev, var, covar Percentile_approx Data Mining Ngrams, sentences (text analysis) Histogram_numeric SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus
36. Local Mode Execution Avoids map/reduce cluster job latency Good for jobs which process small amounts of data Let Hive decide when to use it set hive.exec.model.local.auto=true; Or force its usage set mapred.job.tracker=local;
37. Automatic Map Join Map-Join if small table fits in memory If it can’t, fall back to reduce join Optimize hash table data structures Use distributed cache to push out pre-filtered lookup table Avoid swamping HDFS with reads from thousands of mappers
38. Multiple DISTINCT Aggs Example SELECT view_date, COUNT(DISTINCT userid), COUNT(DISTINCT page_url) FROM page_views GROUP BY view_date
39. Archiving Use HAR (Hadoop archive format) to combine many files into a few Relieves namenode memory ALTER TABLE page_views {ARCHIVE|UNARCHIVE} PARTITION (ds=‘2010-10-30’)
40. Concurrency Control Pluggable distributed lock manager Default is Zookeeper-based Simple read/write locking Table-level and partition-level Implicit locking (statement level) Deadlock-free via lock ordering Explicit LOCK TABLE (global)
41. Statistics Collection Implicit metastore update during load Or explicit via ANALYZE TABLE Table/partition-level Number of rows Number of files Size in bytes
42. Hive is now a TLP PMC Namit Jain (chair) John Sichi Zheng Shao Edward Capriolo Raghotham Murthy Committers Amareshwari Sriramadasu Carl Steinbach Paul Yang He Yongqiang Prasad Chakka Joydeep Sen Sarma Ashish Thusoo Ning Zhang
43. Developer Diversity Recent Contributors Facebook, Yahoo, Cloudera Netflix, Amazon, Media6Degrees, Intuit, Persistent Systems Numerous research projects Many many more… Monthly San Francisco bay area contributor meetups India meetups ?
44. Roadmap: Heavy-Duty Tests Unit tests are insufficient What is needed: Real-world schemas/queries Non-toy data scales Scripted setup; configuration matrix Correctness/performance verification Automatic reports: throughput, latency, profiles, coverage, perf counters…
45. Roadmap: Shared Test Site Nightly runs, regression alerting Performance trending Synthetic workload (e.g. TPC-H) Real-world workload (anonymized?) This is critical for Non-subjective commit criteria Release quality
46. Roadmap: New Features Hive Server Stability/Deployment File Concatenation Reduce Number of Files Performance Bloom Filters Push Down Filters Cost Based Optimizer Column Level Statistics Plan should be based on Statistics