An intriduction to hive
Upcoming SlideShare
Loading in...5
×
 

An intriduction to hive

on

  • 1,593 views

 

Statistics

Views

Total Views
1,593
Views on SlideShare
1,592
Embed Views
1

Actions

Likes
1
Downloads
54
Comments
0

1 Embed 1

http://www.techurtime.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • هایو روی هادوپ ساخته شده تا بتوان روی BigData کوئری زد. هایو در فیسبوک ایجاد شد.مشکلی فیسبوک با آن روبرو بود بعد از آن مشکل خیلی از شرکت‌های دیگر هم شد و کم کم کارایی و قابلیت‌های rdbmsها و NoSqlها در داده‌های بزرگ کمرنگ شد.گزارشات کم کم چند دقیقه طول کشیدند و گاهی ساعت‌ها زمان بردند.گاهی همزمانی دو گزارش مشکل بزرگی را به وجود آورد.کم کم سیستم ها کند شدند و گیر کردند و یا از دسترس خارج شدند.تازه بعد از حل این مشکل نیاز به اطلاعات بدون درگیر شدن به MR هم به چشم امد. لازم بود که اطلاعات را بدون داشتن تسلط به دانش پیچیده مپ ریدوس فراخوانی و استفاده کنند.هادوپ اسکیما نداشت و کار باهاش سخت بود.Not ReusableFor complex jobs:Multiple stage of Map/Reduce functionsمثال مشکل شرکت مخابرات استان تهران برای اعلام لیست قطعی و یا تغییرات در دیتابیس خود.مثال کوئری ۳۶ ساعته و ۲۴ ثانیه‌ایمثال توانیر
  • هادوپ چیست؟رایگان و متن باز.فرق هست بین متن باز رو رایگان این هم رایگان هست و هم متن بازDWareHouse برای هادوپ است.یک انتزاع هست و یک سیستم انتزاعی است.
  • چیزی که در مورد هایو جالبه اینه که این امکان رو می ده که بدون داشتن دانش نگاشت کاهشیبتونیم از هادوپ و امکانات بیگ دیتا استفاده کنیم.بهره‌مندی از امکانات scalable با وجود استفاده از واسط Query Languageای که مشابه با SQL قدیمی هست.هایو در سال ۲۰۰۸ توسط فیسبوک متن باز شد و تحت لایسنس آپاچی در اومد.
  • OLAP: online analytical processingOLTP: online transactional processing
  • Hadoop: Hive needs Hadoop as a Base Framework to operate.Driver: Hive has its own drivers to communicate with the Hadoop World.CLI: The Hive CLI is the console for firing Hive Queries. The CLI would be used for operating on our data.Webinterface: Hive also provides a web interface to monitor/administrate Hive jobs.MetaStore:Metastore is the Hive’s data warehouse which stores all the structure information of various tables/partitions in Hive.(Database Catalog)Thrift Server: we can expose Hive as a service which can then be used for connecting via JDBC/ODBC etc.
  • Hadoop: Hive needs Hadoop as a Base Framework to operate.Driver: Hive has its own drivers to communicate with the Hadoop World.CLI: The Hive CLI is the console for firing Hive Queries. The CLI would be used for operating on our data.Webinterface: Hive also provides a web interface to monitor/administrate Hive jobs.MetaStore:Metastore is the Hive’s data warehouse which stores all the structure information of various tables/partitions in Hive.(Database Catalog)Thrift Server: we can expose Hive as a service which can then be used for connecting via JDBC/ODBC etc.
  • UDF User Defined functions
  • Directed acyclic graph: is a directed graph with no directed cycles.
  • پارتیشن: هر جدول می تواند یک یا چند کلید پارتیشن داشته باشد. اطلاعات براساس کلید پارتیشن در فایل‌ها ذخیره می‌شوند. بدون پارتیشن کل دیتا به MR ارسال می شوند اما با پارتیشن ارسال اطلاعات به MR مدیریت می شود.باکت: اطلاعات هر پارتیشن هم براساس hash valueها دسته بندی می‌شوند.این‌اطلاعات در همان پوشه‌ی پارتیشن نگهداری می‌شود.
  • برای کار با داده‌های پیچیده و delimeterهای چند حرفی و پیچیده.کاربرد: پردازش لاگ‌ها
  • DISTRIBUTE BY + Sort By = Cluster byشبیه به group by
  • این‌ها مثل log4jبا این تفاوت که پیش و پس پردازش روی لاگ دارند.
  • Drill:Design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds

An intriduction to hive An intriduction to hive Presentation Transcript

  • An Introduction to Apache HIVE Credits By: Reza Ameri Semester: Fall 2013 Course: DDB Prof: Dr. Naderi
  • Agenda • Starting Note – What is Hive – What is cool about Hive – Hive in use – What Hive is not? • Brief About Data Warehouse An Introduction to Apache HIVE 2 of 31
  • Agenda- Contd. • Hive Architecture – Components – Architecture Diagram • Hive in Production – HQL – Data Insertion/Aggregation • Performance • Further Reading • References An Introduction to Apache HIVE 3 of 31
  • Starting Note • What is Apache Hive? – Open Source (Very Important!) So Free  – Data Warehouse System on Hadoop – Provides HQL(SQL like query interface) – Suitable for Structured and Semi-Structured Data – Capability to deal with different storages and file formats An Introduction to Apache HIVE 4 of 31
  • Starting Note- Contd. • What is cool about Hive – Let users use MR without thinking MR with HiveQL interface. • Some history – Hive is made by Facebook! – Developing by Netflix aslo. – Amazon uses it in Amazon Elastic MapReduce An Introduction to Apache HIVE 5 of 31
  • Starting Note- Contd. • What Hive is not – Does not use complex indexes so do not response in a seconds! – But it scales very well and, It works with data of Peta Byte order – It is not independent and it’s performance is tied Hadoop An Introduction to Apache HIVE 6 of 31
  • Brief About Data Warehouse • OLAP vs OLTP – DW is needed in OLAP – We want report and summary not live data of transactions for continuing the operate – We need reports to make operation better not to conduct and operation! – We use ETL to populate data in DW. An Introduction to Apache HIVE 7 of 31
  • Brief About Data Warehouse Inmon approach vs Kimbal approach An Introduction to Apache HIVE 8 of 31
  • Brief About Data Warehouse Inmon approach vs Kimbal approach An Introduction to Apache HIVE 9 of 31
  • Brief About Data Warehouse • Other keywords – ODS- Operational Data Store – Fact Tables – Data Mart – Dimensions – Concurrent ETLs An Introduction to Apache HIVE 10 of 31
  • Hive Architecture • Components – Hadoop – Driver – Command Line Interface (CLI) – Web Interface – Metastore – Thrift Server An Introduction to Apache HIVE 11 of 31
  • Hive Architecture An Introduction to Apache HIVE 12 of 31
  • Hive Architecture Map Reduce Web UI + Hive CLI + JDBC/ODBC User-defined Map-reduce Scripts HDFS Browse, Query, DDL Hive QL MetaStore Parser UDF/UDAF substr sum average Planner Execution Thrift API Optimizer SerDe CSV Thrift Regex An Introduction to Apache HIVE FileFormats TextFile SequenceFile RCFile 13 of 31
  • Hive Architecture- Contd. – Internal Components • Compiler and Planner – It compiles and checks the input query and create an execution plan. • Optimizer – It optimizes the execution plan before it runs. • Execution Engine – Runs the execution plan. It is guaranteed that execution plan is DAG An Introduction to Apache HIVE 14 of 31
  • Hive Architecture- Contd. • Hive Data Model – Any data in hive is categorized in • Databases – First level of abstraction. • Tables – Ordinary tables • Partition – To handle data transferring in MR. • Bucket – Facilitate the data access in partitions. An Introduction to Apache HIVE 15 of 31
  • Hive in Production • Log processing – Daily Report – User Activity Measurement • Data/Text mining – Machine learning (Training Data) • Business intelligence – Advertising Delivery – Spam Detection An Introduction to Apache HIVE 16 of 31
  • Hive in Production – HQL • • • • • Create Row Format SerDe Select Cluster By/Distribute By – Data Insertion/Aggregation An Introduction to Apache HIVE 17 of 31
  • HQL- Samples • CREATE TABLE CREATE TABLE movies (movie_id int, movie_name string, tags string) • ROW FORMAT ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘:’; An Introduction to Apache HIVE 18 of 31
  • HQL- Samples • Partition create table table_name ( id int, date string, name string) partitioned by (date string) An Introduction to Apache HIVE 19 of 31
  • HQL- Samples • SerDe – User Table with “id::gender::age::occupation::zipcode” format. CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)"); An Introduction to Apache HIVE 20 of 31
  • HQL- Samples • Select SELECT * FROM movies LIMIT 10; • Distribute By – Select * from movies distribute by tags; – Select the column to organize data while sending it to reducer. An Introduction to Apache HIVE 21 of 20
  • Hive Process • Data Insertion/Aggregation – Bulk • ETL – Talend - Community version – Sqoop (SQl to hadOOP, Apache license) – SyncSort – Not Free! An Introduction to Apache HIVE 22 of 31
  • Hive Process- Contd. – STP(Straight Through Processing) • Flume – Apache lisenced • Chukwa - a part of Apache Hadoop distribution • Scribe – Facebook solution for log processing and aggregation. An Introduction to Apache HIVE 23 of 31
  • Hive Process- Contd. • NetFlix Case Study – Usage of Chukwa – Log processing – Count Errors per session – Count Streams per day – Ad-hoc queries like summaries (sum, max, min, …) An Introduction to Apache HIVE 24 of 31
  • Hive Process- Contd. An Introduction to Apache HIVE 25 of 31
  • Hive Process- Contd. • Phase 1 – Hadoop job parses the logs and loads to Hive every hour. – Previous job should also run every 24 hours for summary • Phase 2 – Real-time log processing(parse/merge/load) – Chukwa has non-stop log collection. An Introduction to Apache HIVE 26 of 31
  • Performance • According to Globant investigations • Tables: An Introduction to Apache HIVE 27 of 31
  • Performance An Introduction to Apache HIVE 28 of 31
  • Performance An Introduction to Apache HIVE 29 of 31
  • Further Reading • Apache Drill – Software framework that supports data-intensive, distributed applications, for interactive analysis of large-scale datasets • PIG – MR Platform for creating and using MR on Hadoop • • • • • • • Oracle Big Data DB2 10 and InfoSphere Warehouse Parallel databases: Gamma, Bubba, Volcano Google: Sawzall Yahoo: Pig IBM: JAQL Microsoft: DradLINQ , SCOPE An Introduction to Apache HIVE 30 of 31
  • References • • • • • • • • https://www.facebook.com/note.php?note_id=89508453919 https://github.com/facebook/scribe http://sqoop.apache.org/docs/ http://flume.apache.org/FlumeDeveloperGuide.html Sqoop Database Import For Hadoop, Cloudera, Oct.2009 https://cwiki.apache.org/confluence/display/Hive/LanguageManual http://www.semantikoz.com/blog/the-free-apache-hive-book/ BEGINNING MICROSOFT® SQL SERVER® 2012 PROGRAMMING, Wiley, Paul Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2 • Hive – A Petabyte Scale Data Warehouse Using Hadoop, facebook team, 2009 An Introduction to Apache HIVE 31 of 31
  • Thanks… An Introduction to Apache HIVE