Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining Information from Data on Cloud

1,244 views

Published on

Understand the Big Data ecosystem on the Cloud and the building blocks that help you build application for Data Mining and Visualization. Also learn from Latentview Analytics on how they built “PanelMiner” a Platform That Efficiently Transforms Unstructured HTML Data to Structured Data to gain Insights about consumer behavior from large data sets.

Presenter:
Ganesh Raja, Solution Architect, Amazon Internet Services

Ganesh Sankarlingam, Head of Delivery (US West Coast), LatentView Analytics

Shrirang Bapat, Vice President – Engineering, Pubmatic

Published in: Technology
  • I'd advise you to use this service: ⇒ www.HelpWriting.net ⇐ The price of your order will depend on the deadline and type of paper (e.g. bachelor, undergraduate etc). The more time you have before the deadline - the less price of the order you will have. Thus, this service offers high-quality essays at the optimal price.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • High paying jobs on Facebook? $25 per hour, start immediately ♣♣♣ http://ishbv.com/socialpaid/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Who Else Wants To Permanently Cure Their Uterine Fibroids and Achieve LASTING Freedom From PCOS Related Symptoms? learn more... ♣♣♣ https://tinyurl.com/rbqozdv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Mining Information from Data on Cloud

  1. 1. Bangalore
  2. 2. Mining Information from Data on Cloud Ganesh Raja, Solutions Architect Amazon Internet Services
  3. 3. What is Big Data ? When your data sets become so large that you have to start innovating how to Collect, Store, Organize, Analyze and Share it Its tough because of Velocity, Volume and Variety
  4. 4. 380,000,000 Tweets/Day 200,000,000+ New Photos / Day Processes 1.5M+ log events per second 80% Of Data on WWW – Last 2 Years
  5. 5. The cost of data generation is falling
  6. 6. Generation Collection & storage Analytics & computation Collaboration & sharing Lower cost, higher throughput Highly constrained
  7. 7. Cloud Computing Elastic & Highly Scalable + No capital expense + Pay-per-use + On-demand $0 = Remove constraints
  8. 8. Generation Collection & storage Analytics & computation Collaboration & sharing Accelerated
  9. 9. Big data and AWS Cloud computing Big Data AWS Cloud Computing Variety, volume, and velocity requiring new tools Variety of compute, storage, and networking options Massive datasets Massive, virtually unlimited capacity Iterative, experimental style of data Iterative, experimental style of manipulation and analysis infrastructure deployment/usage Frequently not steady-state workload; peaks and valleys At its most efficient with highly variable workloads
  10. 10. Big Data Technology Technologies and techniques for working productively with data, at any scale
  11. 11. One tool to rule them all
  12. 12. Big Data & Analytics @ AWS COLLECT STORE ANALYZE SHARE Direct Connect S3 Import Export S3 EC2 DynamoDB Redshift Glacier EMR Data Pipeline AWS BIG DATA PORTFOLIO Amazon Kinesis
  13. 13. Store anything Object storage Scalable 99.999999999% durability Amazon S3
  14. 14. Real-time processing High throughput; elastic Easy to use EMR, S3, Redshift, DynamoDB Integrations Amazon Kinesis
  15. 15. NoSQL Database Seamless scalability Zero admin Single digit millisecond latency Amazon DynamoDB
  16. 16. Relational data warehouse Massively parallel Petabyte scale Fully managed $1,000/TB/Year Amazon Redshift
  17. 17. Hadoop/HDFS clusters Hive, Pig, Impala, Hbase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Kinesis Amazon Elastic MapReduce
  18. 18. The right tools. At the right scale. At the right time.
  19. 19. Bangalore
  20. 20. Panel Miner: Data Mining and Visualization using AWS Ganesh Sankaralingam
  21. 21. LatentView at a Glance Build Analytics Centers of Excellence (COEs) Analyze Business problems both Qualitatively & Quantitatively and provide Actionable Insights Onsite-offshore global delivery model that helps in-house teams do more with less Identified as “Cool Vendor” in Analytics by Gartner 2014 Won the Deloitte Technology Fast 50 India award for 5 consecutive years (2009 – 13) ‘Top Innovator’ awarded to LatentView by Developer Week (Conference & Festival 2013) Recognized as a global 'Market Leader‘ in the Analytics space by SourcingLine Top Finalist in the ‘We Love Our Workplace 2013’ category.
  22. 22. Business Pain Points: Required to combine different types of data to make Business decisions Within the firewall Outside the firewall Internal Structured Data External Structured Data Within the firewall Outside the firewall External Unstructured Data Internal Unstructured Data o ERP, Legacy data o RDBMS or excel format o Email text, Customer service notes, Yammer o Webserver logs o Survey o Market Research o Macroeconomics o Promotions o Social media, News articles, Panel data o Real time visualization of Machine logs (IOT)
  23. 23. Technical Pain Points: Required to combine different types of data Transform unstructured data into structured data queried using SQL statements Automated scalable framework to process > 500K small files in constant time Achieve high efficiency converting unstructured data to structured data Control Security and access for different business users Minimize the costs and time running distributed jobs Store and Retrieve data for Analysis purposes in the cost and time efficient manner Track various processes in the AWS platform
  24. 24. Why AWS? Cost of Ownership Scalable, Easy to use Easy to acquire additional machines based on needs PetaByte level scalability (1 000 000 000 000 000 Bytes) Data Security High Availability Technology Breadth and Technical support
  25. 25. PANEL MINER Converting Unstructured to Structured Data using AWS Infrastructure Unstructured Data Data Collection EC2, S3 Download, Extract, Clean and Stage Data for Processing Python Parser to Convert Unstructured Data into Structured Data EMR Hadoop Optimized Data Processing Redshift Data Warehousing, and Reporting Structured Data Analysis Using excel, tableau and other visualization tools
  26. 26. Key Benefits and Learnings with AWS
  27. 27. Bangalore
  28. 28. Analytics in the Cloud Leverage AWS to scale Big Data Analytics Shrirang Bapat, VP Engineering, PubMatic
  29. 29. Shrirang Bapat Data Enthusiast Innovation Agent Agile Evangelist VP Engineering at PubMatic Your Speaker Today
  30. 30. Every Ad Every Screen IAB Standard Banners IAB Rising Stars Native and custom units Mobile Applications Tablet Applications Rich media: MRAID 1 & 2, ORRMA, interstitial Video: VAST, VPAID Mobile & Tablet Optimized Web Desktop Web One Platform Multi-Format, Multi-Screen, Multi-Channel Every Sales Channel Direct Sales Integration Programmatic Direct • Private Marketplace • Automated Guaranteed Open Auction Spot-buys
  31. 31. Premium at Scale, Across All Buying Channels 33 Programmatic Direct Channels Definition Value Automated Guaranteed Direct bought guaranteed inventory access, non-RTB Predictable and scalable high value placements Open Market RTB based inventory access in open marketplace Efficient and targeted audience buying Private Marketplace Direct bought RTB based inventory access Controlled buying with price agreements for bids
  32. 32. PubMatic is the Only Publisher-Focused Software Platform at Scale 94.5% U.S. Reach, Larger Than Google (comScore March 2014) Industry’s Best Results, Independent & Flexible 5 Data Centers, 4 Trillion RTB Requests Monthly 500+ People Doing Business in 30 Countries
  33. 33. 5PB 35 4,000,000,000,000+ Bids 6AWS Regions 350,000,000,000+ Impressions
  34. 34. Real-time Slice and Dice Hyper growth Older infrastructure Time to market New Architecture
  35. 35. • Big Data Pipeline • Real Time • EMR and HBASE 37 • Adserving PubMatic on AWS
  36. 36. Storage Database Stream Processing Compute EMR Networking Monitoring DNS 38 AWS Services
  37. 37. If you only take away 3 things… Ease of Use Reduced Time to Market DevOps

×