Hadoop & Hive Change the Data Warehousing Game Forever

2,747 views

Published on

Published in: Technology, Sports
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,747
On SlideShare
0
From Embeds
0
Number of Embeds
82
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Batman *Forever*
  • Our ability to capture data has far exceeded our ability to analyze it
    Traditional data warehousing tools have not kept pace with the growth of data
    Hadoop allows us to capture and store data economically but tradition BI tools and approaches don’t work


    IDC

    “Currently a quarter of the information in the Digital Universe would be useful for big data if it were tagged and analyzed. We think only 3% of the potentially useful data is tagged, and even less is analyzed”
  • Sad panda
  • Happy panda!
  • Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  • Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  • Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  • Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  • Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  • Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  • Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  • Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  • The Klout architecture is made up of open source tools.

    Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse.

    We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
  • Great page by page analysis, great reports, but couldn’t send user identifiable data
  • Mixpanel has great support for real time events, but we couldn’t send all the necessary data to really draw interesting conclusions. Joining on data was still going to be a huge challenge.
  • We had all our data, but of course, that was about it
  • We couldn’t cross the streams. We wanted to discover really interesting patterns and make advanced recommendations based on who the user was.
  • At Klout, we used web analytics tools like Google Analytics and Mixpanel to understand how our users interacted with our web site and mobile app. However, we could not join the usage data with our profile data. This made for an incomplete view of our users.

    We decided to build a flexible, event oriented architecture to capture all events for user activity. This is the architecture.
  • First, we invented a simple, JSON oriented event capture method. This allowed our web and app designers to add instrumentation without regards to how it would affect the downstream analytics applications or Hive warehouse.
  • Next, using Flume, we mapped the semi-structure data stream into time partitioned files in Hadoop HDFS.
  • We then created an EXTERNAL Hive table on top of this file structure. That allowed us to “query” the incoming files in HDFS.
  • In order to provide an interactive query environment (OLAP), we connected SQL Server Analysis Services directly to the Hive warehouse and continuously updated a MOLAP cube with the data.
  • We then could hook up internally developed applications (Event Tracker) to our data by having the applications generate MDX (multi-dimensional query language) and run them against our cube.
  • Or we could use the Hive CLI (command line interface) to execute queries using SQL directly against our Hive warehouse.
  • Thumbs up!
  • The Klout architecture is made up of open source tools.

    Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse.

    We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
  • The Klout architecture is made up of open source tools.

    Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse.

    We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
  • By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
  • By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
  • By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
  • Apache Hive’s reliance on MapReduce as it’s core data processing engines makes it unsuitable for interactive queries due to startup times and MapReduce’s batch nature.

    There are several approaches emerging to address these deficiencies that are still Hive catalog compatible. These developments are what makes using Hadoop/Hive as the world’s least expensive but scalable data warehousing platform possible.
  • Aron MacDonald
    Source: http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws

    Cloudera Impala which is essentially free, performs almost as well as an expensive alternative that relies on memory caching for delivering performance.
  • Shark, Impala, etc. turn Hive into a real interactive SQL query environment. This is a huge advancement and was the missing piece that makes Hadoop into the world’s cheapest most scalable database.

    Here’s a query that demonstrates Shark/Hive’s support for non-scalar data types:

    use aw_demo;

    describe factinternetsales;

    select a.year, s.stylename, a.num_orders from (
    select part_year as year,
    product_info["style"] as style,
    sum(orderquantity) as num_orders
    from factinternetsales
    where part_year < 2007
    group by product_info["style"], part_year
    ) a
    left outer join dimstyle s on a.style = s.stylekey
    order by year, num_orders desc;

  • Too Long; Didn’t Watch
  • Too Long; Didn’t Watch
  • Hadoop & Hive Change the Data Warehousing Game Forever

    1. 1. Hadoop & Hive HOW Data Warehousing Game CHANGE THE Forever Dave Mariani CEO & Founder of AtScale @dmariani Atscale.com 2014 Hadoop Summit San Jose, CA June 3, 2014
    2. 2. The Truth about Data
    3. 3. 44 “We think only 3% of the potentially useful data is tagged, and even less is analyzed.” Source: IDC Predictions 2013: Big Data, IDC “90% of the data in the world today has been created in the last two years” Source: IBM
    4. 4. In 2012, 2.5 quintillion byes of data was generated every day Source: IBM
    5. 5. 2,500,000,000 Gb
    6. 6. 7 …and that was back in 2012
    7. 7. The Broken Promise
    8. 8. What we wanted
    9. 9. What we got
    10. 10. Time for a new approach
    11. 11. Relational DBs Volume: Write twice Variety: Structured Velocity: Early Transformation
    12. 12. Hadoop Volume: Write once Variety: Semi-structured Velocity: Late Transformation
    13. 13. INPUT DATA HADOOP ETL MART MART MART QUERY ENGINE VISUALIZER
    14. 14. INPUT DATA HADOOP ETL MART MART MART QUERY ENGINE VISUALIZER
    15. 15. INPUT DATA HADOOP (HIVE) VISUALIZER
    16. 16. Case Study Klout
    17. 17. 20
    18. 18. 15 Social networks processed daily
    19. 19. 769 TB of data storage
    20. 20. 200,000 Indexed users added daily
    21. 21. 400,000,000 Users indexed daily
    22. 22. 12,000,000,000 Social signals processed daily
    23. 23. 50,000,000,000 API calls delivered monthly
    24. 24. 10,080,000,000,000 Rows of data in the data warehouse
    25. 25. Trillions!
    26. 26. Klout’s Data Architecture
    27. 27. A common question How are users using our site?
    28. 28. Google Analytics + Great page by page analysis - User identifiable data against EULA
    29. 29. MixPanel + Great event tracking - Can send user specific identifiers, big limitations
    30. 30. Klout + All our data telling us who these people actually are - That’s about it
    31. 31. Klout’s Event Tracker
    32. 32. { "project": "plusK", "event": "spend", "ks_uid": 123456, "type": "add_topic" }
    33. 33. { "project": "plusK", "event": "spend", "session_id": "0", "ip": "50.68.47.158", "kloutId": "123456", "cookie_id": "123456", "ref": "http://www.klout.com", "type": "add_topic", "time": "1338366015" }
    34. 34. EVENT_LOG tstamp INT project STRING event STRING session_id BIGINT ks_uid BIGINT ip STRING attr_map MAP<STRING,STRING> json_text STRING dt STRING hr STRING
    35. 35. SELECT { [Measures].[Counter], [Measures].[PreviousPeriodCounter]} ON COLUMNS, NON EMPTY CROSSJOIN ( exists([Date].[Date].[Date].allmembers, [Date].[Date].&[2012-05-19T00:00:00]:[Date].[Date].&[2012-0 02T00:00:00]), [Events].[Event].[Event].allmembers) DIMENSION PROPERTIES MEMBER_CAPTION ON ROWS FROM [ProductInsight] WHERE ({[Projects].[Project].[plusK]})
    36. 36. SELECT get_json_object(json_text,'$.sid') as sid, get_json_object(json_text,'$.kloutId') as kloutId, get_json_object(json_text,'$.v') as version, get_json_object(json_text,'$.status') as status, event FROM bi.event_log WHERE project='mobile-ios' AND tstamp=20121027 AND event in ('api_error', 'api_timeout') ORDER BY sid;
    37. 37. So, what’s wrong with this picture?
    38. 38. Klout’s Data Architecture
    39. 39. Klout’s Data Architecture
    40. 40. Case Study Online Gaming (MMO)
    41. 41. ... LogInt1369155542t4533245t”loc":”23”,"rank":"Expert”,"client":"ios"lf Buyt1369155556t4533446t”loc":”23”,"item":"212”,"ref”:”ask.com”,"amt":"1.50"lf ... Capture Event Name Timestamp User ID Attributes
    42. 42. CREATE EXTERNAL TABLE event_log ( event STRING, event_time TIMESTAMP, user_id INTEGER, event_attributes MAP<STRING, STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' COLLECTION ITEM TERMINATED BY ',' PARTITIONED BY (day(FROM_UNIXTIME(event_time)), INTEGER) LOCATION '/user/event_logs’; Event Name Timestamp User ID Attributes Map
    43. 43. SELECT SUBSTR(FROM_UNIXTIME(event_time),1,7) AS MonthOfEvent, event_attributes[”loc"] AS Location, count(*) AS EventCount FROM event_log WHERE year(FROM_UNIXTIME(event_time)) = 2014 GROUP BY SUBSTR(FROM_UNIXTIME(event_time),1,7), attributes[”loc"] Event Name Timestamp User ID Attributes Transform and Query
    44. 44. Hive for analytics Why now?
    45. 45. New, Interactive Flavors
    46. 46. Shark Impala Stinger
    47. 47. Shark Impala Stinger Performance approach Caching Optimizer Improve Hive Theoretical limits (# of rows) Billions Trillions Trillions Supports UDFs, SerDes Yes Fall ‘14 Yes Supports non-scalar data types Yes Fall ‘14 Yes Preferred file format Tachyon Parquet ORC Sponsorship Databricks Cloudera Hortonworks
    48. 48. Hive is a cheap MPP database
    49. 49. Records Returned Time (Seconds) Select Statement HANA Small Impala Small (1 Node) Parquet Impala Small (3 Nodes) Parquet Impala Small (1 Node) Text Impala Small (3 Nodes) Text select count(*) from lineitem 1 1 3 1 74 31 select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29 select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by l_shipmode 7 8 23 5 74 28 select l_shipmode, count(*), sum(l_extendedprice) from lineitem where l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28 select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem group by l_shipmode, l_linestatus 14 10 32 7 74 28 select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' group by l_shipmode, l_linestatus 1 1 27 5 72 29 select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 23 5 73 30 select l_shipmode, l_linestatus, l_extendedprice from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31 select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 104 21 73 30 Size (5 Part.) 1.9Gb (40 files x 80mb) 3.2Gb (1 file – No Compression) 7.2Gb Est. Monthly Cost of Production Environment on AWS (HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350Source: Aron MacDonald, http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws TPC-H Query Run Times (Impala vs. HANA) Line item table, 60 Million Rows
    50. 50. Real Customer Scenario: Impala (CDH5) 5 Data Node Cluster, 16Gb of RAM each, 4 Cores each, Parquet Fact Table (# of rows) Dimensions Execution Time (seconds) 1 Dimension (28 rows) 0:00:00.836 2 Dimensions (28 rows, 28 rows) 0:00:00.767 2 Dimensions (28 rows, 2,926 rows) 0:00:00.660 3 Dimensions (28 rows, 28 rows, 2,926 rows) 0:00:00.871 1 Dimension (2,926 rows) 0:00:00.490 2 Dimensions (28 rows, 2,926 rows) 0:00:00.705 1 Dimension (28 rows) 0:00:06.780 2 Dimensions (28 rows, 3,782 rows) 0:00:23.097 0 Dimensions 0:00:52.074 1 Dimension (121,964,466 rows) - Count Distinct 0:00:55.861 1 Dimension (5,547,151 rows) 0:01:08.972 2 Dimensions (5 rows, 2,926 rows) 0:00:45.060 3 Dimensions (5 rows, 2,926 rows, 3,782 rows) 0:01:39.119 0 Dimensions 0:00:06.945 2 Dimensions(5 rows, 40,040 rows) 0:00:31.980 4 Dimensions (2 rows, 487,374 rows, 7,875,489 rows, 2,038,760 rows) 0:01:33.404 0 Dimensions 0:00:12.854 1 Dimension (8,038 rows) 0:00:24.083 995,761,863 1 Dimension (3,782 rows) 0:01:23.484 1 Dimension (28 rows) 0:00:56.716 1 Dimension (5 rows) 0:00:33.750 2 Dimensions (5 rows, 3,782 rows) 0:01:11.021 0 Dimensions 0:00:32.854 2 Dimensions (3 rows, 371 rows) 0:00:54.329 520 15,036 55,676 72,745,961 121,964,466 263,223,987 378,706,328 587,679,516 1,064,423,864 1,174,737,467
    51. 51. Hive v Impala
    52. 52. TL;DW (Too long; didn’t watch)
    53. 53. DO DO NOT Capture Data Aggregate Data
    54. 54. DO DO NOT T (Transform) ETL (Extract, Transform, Load)
    55. 55. DO DO NOT Schema on Read Schema on Load
    56. 56. DO DO NOT Query in Place Create Data Marts
    57. 57. Thank you! Dave Mariani CEO & Founder of AtScale @dmariani Atscale.com

    ×