Real Time Analytics using Cloudera Impala in Manufacturing use case
Upcoming SlideShare
Loading in...5
×
 

Real Time Analytics using Cloudera Impala in Manufacturing use case

on

  • 446 views

CSCI E-185 Big Data Analytics -- Final project, Fall 2013

CSCI E-185 Big Data Analytics -- Final project, Fall 2013

Statistics

Views

Total Views
446
Views on SlideShare
445
Embed Views
1

Actions

Likes
1
Downloads
11
Comments
0

1 Embed 1

http://plus.url.google.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Real Time Analytics using Cloudera Impala in Manufacturing use case Real Time Analytics using Cloudera Impala in Manufacturing use case Presentation Transcript

  • Final Project Real Time Analytics using Cloudera Impala in Manufacturing use case Rapheephan Thongkham-uan (Nancy) CSCI E-185 Big Data Analytics @Rapheephan Thongkham-Uan Friday, May 10, 13
  • To make Big Data makes Money In manufacturing, ... • We want to improve the supply chain management by tracking the defective parts, finding the bottlenecks, etc. • We are doing the analysis on the big amount of data using traditional tools that takes too much time. • • People in the factory are familiar to SQL query. The faster we analyze the big data, - faster defects/bottlenecks detection near real-time problem solving, decision-making less time and money spending on the defects That’s why we need Cloudera Impala @Rapheephan Thongkham-Uan Friday, May 10, 13
  • Requirements • Cloudera Manager 4.5.2 installation guide - • • http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Free/latest/ClouderaManager-Free-Edition-Installation-Guide/Cloudera-Manager-Free-Edition-Installation-Guide.html My VM - Ubuntu 12.04 (Precise) 64-bits CDH 4.2 Cloudera Management 4.5.2 I installed Impala via Cloudera Manager @Rapheephan Thongkham-Uan Friday, May 10, 13
  • After finishing cloudera manager installation @Rapheephan Thongkham-Uan Friday, May 10, 13
  • We will use Hue Web UI to query Impala From the Services menu bar, click HUE1 and choose Hue Web UI @Rapheephan Thongkham-Uan Friday, May 10, 13
  • Create table in Hive Create Hive table with user impala then load the data from local into the table $ sudo -E -u impala hive -e “CREATE TABLE khsample (id INT, sdate STRING, seq INT, product STRING, ope STRING, resource_grp STRING, resource STRING, inflow FLOAT, proclot FLOAT, wip FLOAT, ope_rate FLOAT) ROW FORMAT DELIMITED FILEDS TERMINATED BY ‘,’;” $ sudo -E -u impala hive -e “LOAD DATA LOCAL INPATH ‘KH_RESULT.csv’ INTO TABLE khsample;” @Rapheephan Thongkham-Uan Friday, May 10, 13
  • Sample table in Hue Web UI We can view the table we just created in Hive shell on Hue Web UI *the input data is included japanese characters which cannot be read. @Rapheephan Thongkham-Uan Friday, May 10, 13
  • Create table in Hive Before querying Impala on Hue Web UI, we have to refresh the Impala first. In the Impala-shell, input the following command $ impala-shell [impala-server:21000] > refresh; @Rapheephan Thongkham-Uan Friday, May 10, 13
  • Query in Impala In Hue Web UI, click Impala icon the query editor page will be shown. input the query and execute @Rapheephan Thongkham-Uan Friday, May 10, 13
  • Bottlenecks query - To find the groups of machines which are the bottlenecks, we can calculate from WIP by day. The group of machines which WIP value is higher than the day before can be predicted as bottleneck. - The simulation dates were from 12/13 to 12/22. I will get the summation of WIP values from the sampling dates (12/14, 12/16, 12/18, 12/20, 12/22). - We have to do 5 sub-queries in FROM statement. @Rapheephan Thongkham-Uan Friday, May 10, 13
  • Bottlenecks query (2) SELECT A.resource_grp, (SELECT resource_grp, sum(wip) as dwip A.awip as wip22, --12/22 wip FROM khsample B.bwip as wip20, --12/20 wip WHERE id = 118 and sdate =’”2012/12/16”’) D join C.cwip as wip18, --12/18 wip (SELECT resource_grp, sum(wip) as ewip D.dwip as wip16, --12/16 wip FROM khsample D.dwip as wip14 --12/14 wip WHERE id = 118 and sdate =’”2012/12/14”’) E FROM (SELECT resource_grp, sum(wip) as awip WHERE A.resource_grp = B.resource_grp FROM khsample and A.resource_grp = C.resource_grp WHERE id = 118 and sdate =’”2012/12/22”’) A join and A.resource_grp = D.resource_grp (SELECT resource_grp, sum(wip) as bwip and A.resource_grp = E.resource_grp FROM khsample and A.awip >= B.bwip and B.bwip >= C.cwip WHERE id = 118 and sdate =’”2012/12/20”’) B join and C.cwip >= D.dwip and D.dwip >= E.ewip (SELECT resource_grp, sum(wip) as cwip ORDER BY A.awip DESC FROM khsample LIMIT 20; WHERE id = 118 and sdate =’”2012/12/18”’) C join @Rapheephan Thongkham-Uan Friday, May 10, 13
  • Comparing the result of Impala with Oracle SQL @Rapheephan Thongkham-Uan Friday, May 10, 13
  • Results • join 5 sub-queries in Oracle SQL took 50s. • join 5 sub-queries in Impala took 6.67s. • Impala can query 7x faster with the same results. • In the real use, we could configure Impala to work with HBase, also change Hive Metastore to OracleDB. @Rapheephan Thongkham-Uan Friday, May 10, 13