Dancing with the Elephant


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Slides from a real PoC using data from an IPTV network looking at Quality of Service and Churn
  • So eBay measured…Note latency – Hadoop is batch-oriented;Note parallel efficiency – if the unit cost of acquisition is relatively low, but I have to buy very many more units, total cost of acquisition is still higherAnd total cost of acquisition is not TCO – also have to factor in development, integration, sys admin and maintenance, etc., etc., etc. costs.Note also that Hadoop is an implementation of the MR programming model, not a DBMS;Impact of, for example, lack of indexes, lack of cost-based optimization, etc., etc. likely to be even more significant for more complex queries.
  • Slides from a real PoC using data from an IPTV network looking at Quality of Service and Churn
  • This scenario involves a Telco company that is experiencing an increased number of cancellations. They want to know what behaviors are leading up to the cancellation and have been unable to discover those reasons until now. The challenge has been twofold. First, their data is on multiple platforms. Secondly, Analysis has been so time consuming that they have been unable to estimate and budget the effort. They have data on Hadoop, processed Web Logs on Aster, and Store data housed on their Teradata EDW. All of this data needs to be combined together then analyzed in a timely fashion. This is a common situation today across many industries. You may see a solution here to your own challenges.During this presentation we will see the real code behind the solution. We will see a 3 way join of data across the three platforms. This has never been done before until done for this demonstration. We will see the analytic results output by nPath, a SQL-MR function that comes with the Teradata Aster platform, and the visualization of those analytic results using Tableau.
  • This is what the environment looks like. On the left is a hadoop cluster storing data on HDFS. This is a large volume of call-center data originally stored as VRU files. After processing on Hadoop it is made available through SQL-H, a new product released with Aster database 5 that allows SQL queries against hadoop data.On the right is the Teradata EDW which contains structured store transactions. It accessed through our Teradata connector using SQL also.In the middle is Online Web log data stored and pre-processed on Aster using our SQL-MR functions.All of these sources are pulled together in a single SQL query on Aster and processed through nPath to discover the customers behavior before cancellation.
  • Let’s walk through the code required to perform this analysis.First, we create an Hcatalog entry for the table.This code shows what is done on the hadoop machine in order to create a table called hive-callcenter in Hcatalog. It is what you might expect for any table definition. Drop the table if it exists then create the structure. You can see that the data is actually stored as a text file in hadoop with the location being a directory hierarchy.
  • Next, we create a view on Aster pointing to hive-callcenter on hadoop. In this case we create a view. However, since a view is actually just SQL code, we could actually put the select statement anywhere in our code. We are creating a permanent view since we will be using this table often.Notice that we called the view hcat_telco_callcenter. We’ll see this again later.
  • This is how we created the view into Teradata for the store data and called it td_telco_store.Again, just plain old SQL.
  • Here is where the 3 way join takes place. We create the view td_telco_multi using the views into Hadoop and Teradata along with data stored on Aster.Remember td_telco_store from the last page and hcat_telco_callcenter from the one before that. telco_online is the data stored on Aster. This is an ansi-standard view created on Aster. This has, quite literally, never been done before it was done for this presentation.
  • Here is another view of the views from Aqua Data Studio. Look closely and you can see the views td_telco_store and hcat_telco_callcenter. Telco_online is a regular table on Aster and is not seen here.Each of these tables/views has around a million rows. When we run a count on the view td_telco_multi we see a little over 3 million rows returned for the time period. As Aqua Data Studio and Tableau demonstrate, these data sources are available to most any BI, system tool, or application that understands ODBC/JDBC. So now that we have all of the views and in place to bring this data together in real-time, how do we supply it to nPath?
  • It’s actually very simple. There is the 3-way join supplied into nPath. Notice that this is just another SQL query. There is some very sophisticated MapReduce code running under the covers of nPath, but to the business user, it is exposed as an external table function with replaceable parameters. This is what makes the very powerfull SQL-Mapreduce functions of Teradata Aster available to the business user without programming experience beyond SQL. Its just replaceable parameters on the function. Being this straightforward is also what makes fast analytic iterations possible. The most important parts of this nPath function are seen in the patterns searched for and the actions taken. They are very simple in this case. Look for all events that end the session in a cancellation of service. If it just an event label it as such. If it is a cancellation of service label it as Cancel Service. Getting the parameters right is the most challenging thing about using the SQL-MR functions. However, since no programming or projects are required, a business user can afford to try lots of different parameters and experiment and explore the data.
  • This is the visualization of the output data we looked at on the previous slide. This represents all customers who cancelled their service and the pathways they took for 4 steps preceding cancellation. Starting at the left we see all of the channels that a customer could have entered with. There are 14 of them. They represent the callcenter data on Hadoop, the online web-logs on Aster, and the store transactional data on the Teradata EDW. This is what the first pass of analysis often looks like in the real world. Its very busy, it’s the first attempt at exploration. There is little or no filtering of data. As you may recall, the nPath statement we looked at was relatively simple. We can see from the thickness of the colored lines on the right side that there is a lot of activity around the call center and the store, but there is too much noise to determine what common behaviors exist that might be actionable. Following this there are numerous iterations of altering the nPath parameters to get to the final, quiet, determination of common behavior.
  • This is the final nPath function that will show us a real Golden Pathway for customer cancellations. It is very similar to the first pass nPath. It’s only a few lines of SQL and some additional parameters. It uses the same 3-way join of data and will execute the next steps identically to the first nPath. Notice in the PATTERN parameters that there is more specificity, and that the actions are more granular. This is how noise was removed from the data. Again, this is the real code that creates the visualizations. Let’s take a look at what this data looks like.
  • Here it is. The Golden Path toward cancellation. It’s a lot cleaner and actually shows us what customers were doing before they cancelled in a way that we can do something about. Starting from the left, we see that customers came in through the onllinechanell and reviewed their contract, followed by at least one, and usually two calls in the call center either disputing their bill or registering a service complaint . The thickness of the lines show us that there were more disputes than service complaints. These calls were followed by visit to the store with a dispute or complain. That is where the cancellations occurred. This is actionable. We can implement this model in our production systems by counting the online visits and calls to the call center. For the entire population of customers, if the number of online reviews > 0, and the number of calls into the call center is >1, then we have a customer who has a higher probablity of cancelling their service and can be flagged for intervention on their next contact. This entire analysis took place over a few days.Let’s think about this for a moment. Imagine trying to come to this conclusion using traditional SQL, without the SQL-MR function of nPath, and without the ability to join this data. The first challenge is pulling the data together with the biggest challenge coming from the data on Hadoop. This currently requireds a skilled engineer writing MapReduce code in a lower level language just to pull the data out. The manipulation of the data once gathered together requires around 350-400 lines of complex, recursive SQL code. Neither the pulling of the Hadoop data or the SQL development is trivial. Both require skilled programmers and, most likely, several months of work. In most shops, this level of resource allocation and time requires a that a project be scoped with detailed requirements, resourced, approved and budgeted. As challenging, expensive, and time-consuming as that project might be, the real problem is that this analysis requires many iterations. In fact, an unknown number of iterations. Each of those iterations may require a separate project. You know, Phase 1, 2, and 3, etc. This actually took more on the order of nine iterations through nPath over several days. So what really happens when confronted by analysis needs like this without nPath. I can tell you that it is usually nothing. You can’t pre-determine the number of iterations, so you can’t scope it. If you can’t do that, you aren’t going to get approval to budget and resource a project that has no end-date in sight. The reality is that most organizations never get to an answer like this. However, using nPath, a business analyst, and a few days work, without ever having to approve a project, not only can one get to the answer, but can also formulate an action plan. That is the real value proposition here. Difficult analysis done quickly by business analysts without the need to budget expansive and in-demand resources.
  • Slides from a real PoC using data from an IPTV network looking at Quality of Service and Churn
  • First looked at analysing the complaints data which was text files stored in Hadoop, got nowhere with this. The text analytics showed that the comments fields held standard phrases such as “No fault found” “customer issue” or was just blank.Good example of fail fast. If it isn’t going to work, realise this and stop doing it as quick as possible.
  • Looked at patterns in data usage prior to a customer closing their account. Here each line represents a customer, it appears that just prior to account closure, there was a huge surge in usage. This turned out to be an error in the data (again!)
  • Decided to look at number of home router reboots as a measure of quality of service.Here the pattern of 5 events preceding a reboot can be seen and the code used to generate the sankey chart ( now native aster format which is viewed in a web browser)
  • As previous data issues found, went back and used SQL and tableau to check the data. Found an issue on september 30th but as the data on;y needs to be “good enough” to run analysis, we can safely ignore this day and just use 1st to 29th for our investigation
  • Final pattern with some of the noise cleaned up… the high transmitted blocks doesn’t help much because it just shows that if you use the serice a lot then you are more likely to reboot…But the other 3 events show a thing called Synchronisation speed errors which are something that can be detected on the network and leads to issues with the iptv signal at the customer end.
  • Using Aster’s built in Graph viz. we can now see the way the synchroerros affect users across the entire network in a single picture. Note the thick red line in the highlighted area and another one down and to the right of it.
  • Talking to the network engineers, we found out that there are two different types of hub used.The older ones are on the left and the newer ones on the right, you can see from the colours that the newer ones are reporting far more errors than the older ones.
  • Final chart.Blue = new hubsOrange = old hubs4th chart shows that the customers connected to the new hubs are complaining more3rd chart shows that complaints by customers connected to the new hubs take longer to resolve-- these show proof of the quality of service issues2nd chart shows bandwidth, higher is better, so new hubs are actually getting a better bandwidth1st chart shows synchro speed, higher is better, so new hubs having worse synchro speed.It looks like the top two are mirror images, so as the bandwidth increases, the synchro speed decreases causing the QoS issue. This turned out to be a firmware issue and not faulty hubs at all.
  • Dancing with the Elephant

    1. 1. Dancing with the Elephant1 4/8/2013 Teradata Confidential
    2. 2. UDA IN PRACTICE• Teradata and Big Data• Customer Churn Example > Examples of Code > How the UDA works in Practice• IPTV Example > Data Science Workflow > Real-life Example
    4. 4. Modern information management: year zero In 1970, computer scientist and former war-time Royal Air Force pilot Ted Codd published a seminal academic paper that would change Information Management forever…
    5. 5. Lots of transactions, or lots of data to analyse? …Codd had envisaged “large, shared data banks”, queried any-which-way; but the first RDBMS implementations had focused on providing support for on-line transaction processing…
    6. 6. Modern information management: year nine …so in 1979, four academics and software engineers quit their days jobs, maxed-out their credit cards – and built the world’s first MPP Relational Database Computer in a garage in California.
    7. 7. Teradata’s “shared nothing” hardware appliancemodel has since been widely emulated*… 1st Teradata implementa on Netezza DATAllegro Oracle Exadata goes live at Wells Fargo Greenplum IBM DB2 Parallel Edi on 1980 1985 1990 1995 2000 2005 2010 Kogni o (WhiteCross) Aster Data Ver ca NeoView * But some are more Massively Parallel Processor than others!
    8. 8. “Teradata was Big Data before there was Big Data” Total data ~40 Exabytes volume under management: Largest single ~40 Petabytes implementation: # customers in 25 the Teradata PB club: Largest hybrid 1,500 SSDs; system: 12,000 HDDs
    9. 9. Key takeaway: “Big Data” are typically non-relationalor “multi-structured” I didn’t say Bill was ugly. I didn’t say Bill was ugly. I didn’t say Bill was ugly. I didn’t say Bill was ugly. I didn’t say Bill was ugly. I didn’t say Bill was ugly.
    10. 10. The Unified Data Architecture Engineers Data Scientists Quants Business Analysts Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualization, etc. Discovery Platform Integrated Data Warehouse Capture, Store, Refine Audio/ Web & Machine Images Text CRM SCM ERP Video Social Logs
    11. 11. Aster SQL-H Integration with Hadoop CatalogA Business User’s Bridge to Analyzing Data in Hadoop• Industry’s First Database Integration with Hadoop’s HCatalog SQL-H• Abstraction layer to easily and efficiently read structured & multi- structured data stored in HDFS Hadoop• Uses Hadoop Catalog (HCatalog) to MR perform data abstraction functions (e.g. automatically understands tables, data partitions) Hive HCatalog• HDFS data presented to users as Aster tables Pig• Fully accessible within the Aster SQL and SQL-MapReduce processing engines, plus ODBC/JDBC & BI tools HDFS11 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    13. 13. SQL-MapReduce 3-Way Join Example Scenario: A Telco company has noticed an increase in the number of their customers cancelling their service. They want to know what customer behavior is leading to termination. They have data in Hadoop, processed web logs on Aster, and store data in a Teradata EDW. They need to combine it to see all channels and get answers What will we see? • Real working code examples • A 3-way join between Aster, Teradata, and Hadoop • Execution of nPath and Pathmap SQL-MapReduce functions sourced by the 3 way join. • Visualization of the results using Tableau.13 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    14. 14. Golden Path Analysis of Cancellation PathsIdentifying Top Multi-channel Cancellation Paths Data on TERADATA HCatalog metadata & Data on HDFS Data on ASTER14 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    15. 15. Create Table structure in HCatalog drop table if exists hive_callcenter; create table hive_callcenter( customer_id int, sessionid int, channel string, action string, datestamp string ) row format delimited fields terminated by t stored as TEXTFILE location /apps/hive/warehouse/hive_callcenter;15 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    16. 16. Create the view into Hadoop using SQL-MRfunction load_from_hcatalog DROP VIEW if exists hcat_telco_callcenter; CREATE VIEW hcat_telco_callcenter AS select "customer_id","sessionid","channel" :: character varying as "channel","action" :: character varying as "action","datestamp" :: timestamp without time zone as "datestamp" from "nc_system"."load_from_hcatalog" (on "public"."mr_driver" server (presales27.asterdata.com) port (9083) dbname (default) tablename (hive_callcenter) username (hive) );16 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    17. 17. Create the view into Teradata using SQL-MRfunction load_from_teradata DROP VIEW IF EXISTS td_telco_store; CREATE VIEW td_telco_store AS SELECT * FROM load_from_teradata(on mr_driver tdpid(dbc) username(dbc) PASSWORD(dbc) QUERY(SELECT * from icw.td_telco_store) NUM_INSTANCES(2) );17 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    18. 18. Create 3-Way View/Join as input to nPathDrop View if exists td_telco_multi;CREATE VIEW td_telco_multi AS select "u"."customer_id" as"customer_id","u"."sessionid" as "sessionid","u"."channel" as "channel","u"."action" as "action","u"."datestamp" as "datestamp" from (( ( ( ( select"t"."customer_id","t"."sessionid","t"."channel","t"."action","t"."datestamp" from "public"."td_telco_store" as "t" ) union all ( select "t"."customer_id","t"."sessionid","t"."channel","t"."action","t"."datestamp" from "public"."hcat_telco_callcenter" as "t" ) ) ) union all ( select "t"."customer_id","t"."sessionid","t"."channel","t"."action","t"."datestamp" from "public"."telco_online" as "t" ) ) ) as "u"; 18 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    19. 19. Views of External Tables from Aster19 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    20. 20. First Pass Aster nPath for Churn Pathway 3 way Join20 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    21. 21. First Pass nPath Visual21 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    22. 22. Final Pass Aster nPath for Churn Pathway 3 way Join22 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    23. 23. Last Pass nPath Visual23 Confidential and proprietary. Copyright © 2012 Teradata Corporation.
    24. 24. IPTV
    25. 25. Starting point: Complaints Data25 4/8/2013 Teradata Confidential
    26. 26. Churners – and data quality26 4/8/2013 Teradata Confidential
    27. 27. What events lead up to a reboot? Note number of paths with areboot, following another reboot! CREATE dimension table wrk.npath_reboot_5events AS SELECT path, COUNT(*) AS path_count FROM nPath (ON wrk.w_event_f PARTITION BY srv_id SELECT * ORDER BY evt_ts desc FROM GraphGen (ON MODE (NONOVERLAPPING ) (SELECT * from wrk.npath_reboot_5events PATTERN (X{0,5}.reboot) ORDER BY path_count SYMBOLS LIMIT 30 ) (true as X, PARTITION BY 1 evt_name = REBOOT AS reboot) ORDER BY path_count desc RESULT item_format(npath) (FIRST( srv_id OF X) AS srv_id, item1_col(path) ACCUMULATE (evt_name OF ANY (X,reboot)) score_col(path_count) AS path) output_format(sankey) ) GROUP BY 1 ; justify(right));27 4/8/2013 Teradata Confidential
    28. 28. View events data in Tableau Looks like an issue with the data on the 30th September and beyond, the Reboot data for October seems to have been aggregated and added to September the 30th28 4/8/2013 Teradata Confidential
    29. 29. Address data quality• Remove paths will all reboots and exclude data from 30th September Would appear that events with suffix 1 and 2 can be added together29 4/8/2013 Teradata Confidential
    30. 30. Visualise as a Graph using Aster GraphGen Size of Node = number of customers Width of Edge = number of errors SELECT * FROM graphgen (ON (SELECT DISTINCT dmt_act_dslam, nra_id, nbr_of_srvid, errorspersrv, nbr_of_dslam FROM wrk.srvid_dslam_err) PARTITION BY 1 ORDER BY errorspersrv item_format(cfilter) item1_col(dmt_act_dslam) item2_col(nra_id) score_col(errorspersrv) cnt1_col(nbr_of_srvid) cnt2_col(nbr_of_dslam) output_format(sigma) directed(false) width_max(10) width_min(1) nodesize_max (3) nodesize_min (1));30 4/8/2013 Teradata Confidential
    31. 31. Synch Issues by Hub Type31 4/8/2013 Teradata Confidential
    32. 32. Error and Complaint rates by equipment type32 4/8/2013 Teradata Confidential
    33. 33. Thank You, Any questions?33 4/8/2013 Teradata Confidential