• Save
AWS Customer Presentation - JovianDATA
 

AWS Customer Presentation - JovianDATA

on

  • 3,393 views

Satya and Anupam, discuss how JovianData uses AWS and share some lessons learned, AWS Startup Tour - SV - 2010

Satya and Anupam, discuss how JovianData uses AWS and share some lessons learned, AWS Startup Tour - SV - 2010

Statistics

Views

Total Views
3,393
Views on SlideShare
3,382
Embed Views
11

Actions

Likes
2
Downloads
0
Comments
0

1 Embed 11

http://www.slideshare.net 11

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • JovianDATA’s mission is to provide a technology platform to help users optimize the entire digital marketing funnel at the lowest cost.
  • At its core, JovianDATA solves the “Analytics on large data” problem. Our customers have huge amounts of digital data and they had 3 central challenges while trying to analyze itHuge upfront and ongoing CapexHuge Maintenance and over provisionisgLack of application richness All of then want to move to the cloud They are not sure aboutCapEx benefitsReadiness for the cloud to support complex application stacks typical of such installations which leads to second order problemApplication Provisioning challenges
  • We believe a fully integrated stack built on the cloud using sophisticated distributed technology with commodity components is key to high performance low cost solution for tackling large data. AWS’s rich cloud functionality and JovianDATA distributed technology makes analytics on large data on a cloud possible. We can takeImpression data Site data And marry them with 3rd party data & sales data thereby providing a unified correlated and sophisticated analysis to various players in the ad ecosystem.
  • Let me present a brief overview of the system, before we go into the details later in the presentation JoviandataSaaS system takes data, provides rich analytics and handles everything in-betweenA distributed ETL layer built on top of Java & Mysql takes raw data and applies single event filters transformations.The data is then loaded into our massively parallel warehouse, where more complex rules which look at all the data (as opposed to single rows) are applied. We also collect statistics here which are used for data cleansing and as well to size the model Using the statistics we build a proprietary array based structure, which provides a illusion of a fully materialized cube. The structure is distributed across the cluster.Then a MDX engine takes a multi-dimensional query, break them into tuples and calculates them in parallel across the cluster. One of the key aspects of the JovianDATA system, is that the load involves complex workflows and we have a framework to manage these workflows very efficiently.The results are apparent. In one of the test ran by a customer, 10 users ran 450 reports on a 2.5 TB warehouse for 6 hours. 90% of the reports returned in less than 10 secs. 40% of the reports in less than 100 ms. The longest a report took was 113 secs.Now Anupam will go into the details of the system.
  • CLICK 1 :- Lets contrast Data Processing on the cloud with the conventional enterprise data center. A user comes in to run analytics.CLICK 2 :- One of the biggest misconception about the cloud is that all you need to do is to let the analyst connect to a datacetner in the cloud. All you need to do is to create an AMI with your favorite software (MySQL, Hadoop, Oracle etc) and bring up hundreds of instances in the cloud. This is like saying EC2 is just about ec2-run-instances. At JovianDATA, we have created an analytics engine that has revisited three key expensive operations in classical stacks. Each operation has been re-written to exploit the cloud.CLICK 3 :- Instead of keeping large clusters to run expensive grouping, we believe in bringing up an extraordinary amount of nodes for a short time. CLICK 4 :- This pre-calculation allows us to Reduce Disk I/O requirements at run time.CLICK 5 :- Most clouds do not excel in inter-processor communication. A big reason for network I/O is the notion of joining tables in databases. CLICK 6 :- We eliminate joining of large tables for the cube structure by using a patent-pending partitioning technique for multi-dimensional data.CLICK 7 :- To allow many users to load the same reports again and again, many of our digital media customers use materialized views which require DBA intervention. CLICK 8 :- Instead, we materialize views based on the customer’s usage without a DBA. This allows often-used data to be available to multitude of users.
  • JovianDATA works with customers which generate 10s of terabytes of data. In 2 years, we have identified 3 major problems that compel enterprise customers to move their analytics on the cloud.Capital expenditure. To get a BI project on 10 terabytes requires nearly an year of planning with an upfront six figure investment just to load and process the data. In this presentation, we will show how you don’t have to sign up for a six figure sum to try out 10 TB analytics.Over Provisioning. Anybody who has worked with a real world analytics stack knows that half the cluster lies underutilized nearly all the time. To add insult to injury, there are periods when the entire cluster lies unused some 20-30% of time like weekends and nights. All these nodes were allocated because of some peak usage on Monday morning. We will show you how we avoid provisioning for peak. Instead, we provision based on usage.Application Isolation. Even after assigning 100s of nodes to a task, applications keep running into each other in a real life deployment. The cloud provides a great opportunity to provision applications in their own sandboxes.
  • CLICK 1 :- Lets look at a classical analytics stack for 15 TB of data … A monolithic stack deployed in the cloud seldom exploits the cloud. In most cases, all it can take care of is expansion. CLICK 2 :- If we keep 15TB up on 100 nodes, it might cost upto$28,800.Is it cheaper than maintaining your own datacenter? Absolutely? Does it really use the cloud’s capability to use-as-you-go? Absolutely not.Lets look at an architecture that was built for the cloud rather than just retro-fitting of an old architecture.
  • In JovianDATA, nodes are allocated based on the need for a particular stage of data processing. CLICK 1 :- Here we shown an example flow where data becomes available sometime at night at the DoubleClick ftp server. CLICK 2 :- As the moon rises (so to speak), the Data Cleansing stage starts off and completes the model building in the hours of the night. CLICK 3 :- The generated model is then hibernated to S3 or EBS. The data cleansing and load model building clusters are terminated.CLICK 4 :- As the sun rises (so to speak), the query cluster is allocated and the model is restored. Our experience here is that even though there are no rampant node failures on EC2, we see failures during transportation of data from one cluster to another. For that, we have invented a patent pending technology which tracks data transportation minutely to make sure data does not disappear while moving from one stage to another.CLICK 5:- The main message here is that we never create a ‘full’ cluster. Instead, we employ role based clusters. This has a dramatic effect on cost. We believe that enterprise software stacks need to move to role based cluster if they want to get 10x savings. Otherwise, they still have the cap ex of allocating hundreds of nodes in the cloud.
  • Almost everybody talks about Cloud Computing enabling on demand performance. But, how easy is it to dial up performance based on usage? Is it just about simply adding more nodes to the cluster? How should the data be redistributed? Should it be done blindly?CLICK 1 :- At JovianDATA, we believe in selective replication to enable performance. Lets see selective replication in action? CLICK 2 :- Lets say a customer is running a Site Section report. If the report takes 10 minutes, we will dynamically provision 2 new nodes. But, it does not make sense to just copy data over to those nodes. CLICK 3 :- Instead, we create copies of the hottest partitions on these new nodes. The hot-ness of data is defined by a statistics package that provides complete visibility of the usage of the cluster.CLICK 4 : - With selective replication of hot data, the report will get generated in 30 seconds rather than 10 minutes. CLICK 5 :- Statistics and automatic algorithms are improtant because we have to do this selective replication on terabytes of data
  • Once the usage goes down, the nodes will be returned back through terminate-instance and the replicas will vanish. Thus, we have increased performance by nearly 20 times without increasing the cost of the analytics stack.They key message here is that blind addition of nodes is not dynamic provisioning. Dynamic Provisioning should work in hand in hand with intelligent replication of data.
  • The next big use case is application isolation. CLICK 1 :- Consider the use case of a Campaign Manager who needs to run a segmentation report which requires a heavy usage of the cluster. The non-cloud option for a customer is to keep the application running 24x7. This is because you never know when the Campaign Manager needs to run this intense report. The other options is to ‘share’ the application across operational reporting and intense analytics. A 24x7 cluster is characterized by high cost. A shared cluster is characterized by maintenance headaches when applications keep running into each other.CLICK 2 :- Keeping an application running in a large, shared cluster has a dramatic effect on the cost. A 50 node application could cost as much $14,400 per monthJovianDATA brings a third option to the table which exploits the cloud economics of cheap storage and elastic provisioning of CPUs.
  • In the scenario which we have often seen with our customers, the analyst comes in, lets say, once every fortnight to do deep analytics. CLICK 1 :- With JovianDATA, the analyst puts in a request for a cluster. The cluster gets allocated on demand. CLICK 2 :- The application Is then brought to life with a parallel restore from S3. With our parallel restore, we have seen as low as 30 minutes for 5-10TB cluster. CLICK 3 :- At JovianDATA, we do these restores by running parallel jobs which transfer data from S3 into EC2. The restore takes care of network failures as well software issues. Once the application is fully restored, the analyst can run their intense analytics in complete isolation from other analysts. Once their analysis is done, they can de-provision the cluster.CLICK 4 :- The cost savings of running on-demand clusters for deep analytics are dramatic. A 24/7 cluster would mean thousands of dollars in recurring expenditure. An on-demand cluster would require 100s of dollars and only when the analyst really uses the cluster.
  • In conclusion, at JovianDATA, we believe that 3 key things are necessary for a ‘real’ cloud computing solution :-EC2 should be exploited to have near 0 capex. If the software is deployed in the classic enterprise cap-ex intensive model, then you are leaving 10x of Cost Savings on the table.EC2 abilities to bring computing in minutes should be exploited to increase performance on demand. If your analytics takes days to improve performance and requires intense DBA intervention, then its not exploiting the cloud. 10x performance should be available in minutes.Using S3 for full application hibernation allows EC2 nodes to be de-provisioned. EC2 nodes should be provisioned only when a customer needs to run intense analytics on a periodic basis. Idle nodes are the biggest no-no and you would be leaving 100x in cost savings on the table if you are not using S3 effectively.

AWS Customer Presentation - JovianDATA AWS Customer Presentation - JovianDATA Presentation Transcript

  • Analytics at the Speed of Thought
    2460 North First Street, Suite 170, San Jose, CA 95131  408-433-9383  www.joviandata.com
  • JovianDATA Mission
    Technology platform to optimize your conversion funnel at the lowest cost
  • Why move to the cloud?
    Considering AWS actively
    but not sure about
    • Cap Ex benefits
    • Current stack’s cloud readiness
    • Application provisioning challenges
  • Introducing JovianDATA
    Ad Server Data, Search Engine Data
    +
    Sales/Conversion Data
    Actionable Insights
    Billions of Impressions, Clicks & Conversions (100’s of TB)
    +
    Site/Web Analytics Data
    No sampling
    Multi-dimensional analytics
    +
    In-Flight
    Customer/3rd Party Data
    Extremely low TCO
    Fast Time-to-Value SaaS
    +
    Other Data Sources
  • Transforming Data to Actionable Insights
    Campaign Heat Map
    Time
    Fully Materialized Data Cube
    • Incremental updates
    • Multi-dimensional indexes
    • Multi-dimensional partitions
    Engagement
    High
    Geography
    Medium
    Low
    Publishers
  • Agenda
    JovianDATA Company Overview
    JovianInsights – The Power of Analytics
    JovianDATA Cube Storage
    Innovations in Advanced Analytics using commodity clusters
    Analytics Lifecycle Management
    Innovations in Cloud Infrastructure Management
  • Avoiding Expensive Data Processing
    Usage based
    Automatic View Materialization
    Avoid Network I/O
    Multi-Dimensional Partitioning
    Reduce Disk I/O
    By Materializing Expensive Groups
  • Why move to the cloud?
  • Agenda
    Reducing Capex
    Dynamic Provisioning
    Application Isolation
  • Managing CapEx with Role Based Clusters
    SINGLE
    CLUSTER FOR
    DATA CLEANSING, LOAD AND QUERY
    15TB
    100 NODES
    Monthly Cost = $28,800
  • Managing Cap-Ex with Role Based Clusters
    UI
    Ad Server Data, Search Engine Data
    2 hours daily for load on 10 nodes
    Query on 5 nodes
    Monthly Cost = $2,052
    DATA CLEANSING
    QUERY
    LOAD MODEL
    HIBERNATE MODEL
  • Agenda
    Reducing Capex
    Dynamic Provisioning
    Application Isolation
  • Selective Replication for on demand perf
    • Power analyst needs to perform complex, heavy number-crunching query that typically take 8 - 10 hours
    • Solution
    • FlexRestoreTM
    • Adds two new temporary nodes (Temp1, Temp2)
    • Creates new replicas for hot partitions and redistributes across nodes
    With Replication Factor = 1
    Site Section Analytics = 10 minutes
    Node1
    Node2
    Node3
    Node4
    P34
    P12
    P1
    P1
    P22
    P3
    P3
    P12
    With Replication Factor = 10
    Site Section Analytics = 30 seconds
    P12
    P22
    P34
    P22
    Nodeset1
    P3
    P34
    P1
    P3
    Temp1
    Temp2
    P1
    P34
    P22
    P12
  • Reduce replication to maintain cost
    • When the analysis is done and the extra performance is not needed, the SLA Controller brings down the two temporary nodes (and the extra replicas)
    • Benefits
    • High performance computing power when you need it
    • But only when you need it to hold down operating costs
    Node1
    Node2
    Node3
    Node4
    P34
    P1
    P12
    P1
    P22
    P3
    P3
    P12
    P12
    P22
    P34
    P22

    Nodeset1
    P3
    P34
    P1
    P3

    Temp1
    Temp2

    P34
    P1
    P22
    P12
  • Agenda
    Reducing Capex
    Dynamic Provisioning
    Application Isolation
  • Provision Tera Scale Applications in Minutes
    Without Application Isolation
    Data for all advertisers is kept
    ‘live’ on 50 nodes
    Campaign Manager needs to run
    heavy duty reports for a
    Big Advertiser
    50 live nodes per month
    =
    $14, 400
    FUNNEL ANALYSIS FOR CLIENT
  • Provision Tera Scale Applications in Minutes
    Application is provisioned in parallel from S3/EBS into EC2
    Campaign Manager requests
    Application Provisioning for a
    Specific Advertiser
    50 nodes for fortnightly analysis
    =
    $320
    FUNNEL ANALYSIS FOR CLIENT
    HIBERNATED MODEL
  • Summary
    Reducing CapEx with Role based Temporary Clusters on EC2
    10x Cost Savings with EC2 usage
    Dynamic Provisioning with Selective Replication on EC2
    10x Performance on EC2 replication
    Application Isolation with Application Hibernation on S3/EBS
    100x Cost Savings with EC2-S3
  • Thank You