Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The architecture of data analytics PaaS on AWS

7,288 views

Published on

JAWS Days 2nd day: Treasure Data presentation.

http://jaws-ug.jp/jawsdays2013/speaker.html#DEV-01

Ustream: http://www.ustream.tv/recorded/30009634

Published in: Technology
  • To get professional research papers you must go for experts like ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

The architecture of data analytics PaaS on AWS

  1. 1. Treasure Data The architecture of data analytics PaaS on AWS Masahiro Nakagawa JAWS Days: 2013/03/16Friday, April 5, 13
  2. 2. Who are you?  Masahiro Nakagawa • @repeatedly / masa@treasure-data.com  Treasure Data, Inc. • Senior Software Engineer, since 2012/11  Open Source projects • D Programming Language • MessagePack: D, Python, etc... • Fluentd: Core, mongo, etc... • etc... 2Friday, April 5, 13
  3. 3. Introduction to Treasure DataFriday, April 5, 13
  4. 4. Company Overview  Silicon Valley-based Company • All Founders are Japanese • Hironobu Yoshikawa • Kazuki Ohta • Sadayuki Furuhashi  OSS Enthusiasts • MessagePack, Fluentd, etc. 4Friday, April 5, 13
  5. 5. Investors  Bill Tai  Naren Gupta - Nexus Ventures, Director of Redhat, TIBCO  Othman Laraki - Former VP Growth at Twitter  James Lindenbaum, Adam Wiggins, Orion Henry - Heroku Founders  Anand Babu Periasamy, Hitesh Chellani - Gluster Founders  Yukihiro “Matz” Matsumoto - Creator of Ruby  Dan Scheinman - Director of Arista Networks  Jerry Yang - Founder of Yahoo!  + 10 more people • and.... 5Friday, April 5, 13
  6. 6. Treasure Data = Cloud + Big Data Cloud Big Data-as-a-Service Database-as-a-service Enterprise Lightweight RDBMS Traditional RDBMS Data Warehouse DB2 On-Premise $34B $10B market market 1Bil entry Data Volume Or 10TB © 2012 Forrester Research, Inc. Reproduction Prohibited 6Friday, April 5, 13
  7. 7. Why Cloud? ‘Time’ is Money Ideal Customer Expectation Value Obsolete over time Reality (On-Premise) Upgrade HW/SW Selection, PoC, Deploy... Time Sign-up or PO 7Friday, April 5, 13
  8. 8. Big Data Adoption Stages Optimization What’s the best? Predictive Analysis What’s a trend? Analytics Statistical Analysis Treasure Data’s FOCUS Why? Alerts Error?(80% of needs) Drill Down Query Where exactly? Reporting Ad-hoc Reports Where? Standard Reports What happened? Intelligence Sophistication 8Friday, April 5, 13
  9. 9. Full Stack Support for Big Data Reporting Our best-in-class architecture Data from almost any source and operations team ensure the can be securely and reliably integrity and availability of your uploaded using td-agent in data. streaming or batch mode. Our SQL, REST, JDBC, ODBC You can store gigabytes to and command-line interfaces petabytes of data efficiently and support all major query tools securely in our cloud-based and approaches. columnar datastore. 9Friday, April 5, 13
  10. 10. Vision: Single Analytics Platform for the World 10Friday, April 5, 13
  11. 11. 11 Our Customers – Fortune Global 500 leaders and start-ups including:Friday, April 5, 13
  12. 12. Treasure Data’s Service ArchitectureFriday, April 5, 13
  13. 13. Treasure Data = Collect + Store + Query 13Friday, April 5, 13
  14. 14. Example in AdTech: MobFox 1. Europe’s largest independent mobile ad exchange. 2. 20 billion imps/month (circa Jan. 2013) 3. Serving ads for 15,000+ mobile apps (circa Jan. 2013) 4. Needed Big Data Analytics infrastructure ASAP. 14Friday, April 5, 13
  15. 15. Two Weeks From Start to Finish! 15Friday, April 5, 13
  16. 16. Used AWS Products (1)  RDS • Store user information, job status, etc... • Store metadata of our columnar database • Queue of worker (perfectqueue / perfectsched)  EC2 • API servers • Hadoop clusters • Job workers • Using Chef to deploy 16Friday, April 5, 13
  17. 17. Used AWS Products (2)  ELB • Load balancing of API servers • Load balancing of td-agents  S3 • Columnar storage built on top of S3 • MessagePack columnar format • realtime / archive storage • Our Result feature supports S3 output. No EMR, SQS and other products ! 17Friday, April 5, 13
  18. 18. Architecture Breakdown Data Collection Data Store/Analytics Connectivity • Increasing variety of • Remaining complexity in • Required to ensure data sources both traditional DWH connectivity with • No single data schema and Hadoop (very slow existing BI/visualization/ • Lack of streaming data time to market) apps by JDBC, REST collection method • Challenges in scaling and ODBC. • 60% of Big Data project data volume and • Output ot other services, resource consumed expanding cost. e.g. S3, RDBMS, etc. 18Friday, April 5, 13
  19. 19. 1) Data Collection  60% of BI project resource is consumed here  Most ‘underestimated’ and ‘unsexy’ but MOST important  Fluentd: OSS lightweight but robust Log Collector • http://fluentd.org/ 19Friday, April 5, 13
  20. 20. Fluentd the missing log collector fluentd.org 20Friday, April 5, 13
  21. 21. In short  Open sourced log collector written in Ruby  Using rubygems ecosystem for plugins It’s like syslogd, but uses JSON for log messages 21Friday, April 5, 13
  22. 22. Time 2012-02-04 01:33:51 Apache Tag apache.log Record { "host": "127.0.0.1", tail "method": "GET", "path": "/", write ... } insert 127.0.0.1 127.0.0.1 127.0.0.1 - - - - - - [11/Dec/2012:07:26:27] [11/Dec/2012:07:26:30] [11/Dec/2012:07:26:32] "GET "GET "GET / / / ... ... ... Fluentd 127.0.0.1 - - [11/Dec/2012:07:26:40] "GET / ... 127.0.0.1 - - [11/Dec/2012:07:27:01] "GET / ... ... event buffering Mongo 22Friday, April 5, 13
  23. 23. Architecture Pluggable Pluggable Pluggable Input Buffer Output > Forward > Memory > Forward > HTTP > File > File > File tail > Amazon S3 > dstat > MongoDB > ... > ... 23Friday, April 5, 13
  24. 24. Before Fluentd Server1 Server2 Server3 Application Application Application ・・・ ・・・ ・・・ High Latency! must wait for a day... Fluent Log Server 24Friday, April 5, 13
  25. 25. After Fluentd Server1 Server2 Server3 Application Application Application Fluentd ・・・ Fluentd ・・・ Fluentd ・・・ In streaming! Fluentd Fluentd 25Friday, April 5, 13
  26. 26. Access logs Alerting Apache Nagios App logs Analysis Frontend MongoDB Backend MySQL System logs Hadoop syslogd filter / buffer / routing Archiving Databases Amazon S3 26Friday, April 5, 13
  27. 27. td-agent  Open sourced distribution package of fluentd  ETL part of Treasure Data  Including useful components • ruby, jemalloc, fluentd • 3rd party gems: td, mongo, webhdfs, etc... • td plugin is for Treasure Data  http://packages.treasure-data.com/ 27Friday, April 5, 13
  28. 28. Treasure Data Service Architecture This! Apache App Treasure Data td-agent columnar data App RDBMS warehouse Other data sources MAPREDUCE JOBS HIVE, PIG (to be supported) td-command Query Query Processing API JDBC, REST Cluster User BI apps 28Friday, April 5, 13
  29. 29. AWS plugins  S3  SNS  SQS  DynamoDB  foward-aws  RDS http://fluentd.org/plugin/  RedShift  CloudWatch  Yet Another Cloud Watch  CloudWatch Lite 29Friday, April 5, 13
  30. 30. 2) Data Store / Analytics - Columnar Storage 30Friday, April 5, 13
  31. 31. Treasure Data Service Processing Flow Worker Frontend Job Queue Hadoop Hadoop Applications push metrics to Fluentd sums up data minutes (via local Fluentd) Fluentd Fluentd (partial aggregation) Treasure Librato Metrics Data for historical analysis for realtime analysis 31Friday, April 5, 13
  32. 32. Friday, April 5, 13
  33. 33. Structure of Columnar Storages import bulk import SELECT ... Import Storage Bulk Import Storage Realtime Storage Archive Storage merge (every 1 hour) 23c82b0ba3405d4c15aa85d2190e 2013-03-15 00:23:00 912ec80 6d7b1482412ab14f0332b8aee119 2013-03-16 00:01:00 277a259 8a7bc848b2791b8fd603c719e54f ... 0e3d402b17638477c9a7977e7dab ... 33Friday, April 5, 13
  34. 34. Query Language Query Execution Columnar Data Object Storage 34Friday, April 5, 13
  35. 35. 1/4: Compile SQL into MapReduce SQL Statement SELECT COUNT(DISTINCT ip) FROM tbl; Hive SQL - to - MapReduce 35Friday, April 5, 13
  36. 36. 2/4: MapReduce is executed in parallel SELECT COUNT(DISTINCT ip) FROM tbl; cc2.8xlarge cluster compute instance (up to 100 nodes * 32 threads) 36Friday, April 5, 13
  37. 37. 3/4: Columnar Data Access SELECT COUNT(DISTINCT ip) FROM tbl; 10Gbps Network Read ONLY the Required Part of Data 37Friday, April 5, 13
  38. 38. 4/4: Object-based Storage 38Friday, April 5, 13
  39. 39. Data first, Schema later SELECT 54 (int) “test” (string) 120 (int) NULL Schema user:int name:string value:int host:int Raw data(JSON) {“user”:54, “name”:”test”, “value”:”120”, “host”:”local”} 39Friday, April 5, 13
  40. 40. 3) Connectivity REST API td-command Query Query Query API Processing JDBC, ODBC Driver Cluster BI apps Web App Treasure Data Result MySQL Columnar Storage S3 … 40Friday, April 5, 13
  41. 41. Multi-Tenancy  All customers share the Hadoop clusters (Multi Data Centers)  Resource Sharing (Burst Cores), Rapid Improvement, Ease of Upgrade Job Submission + Plan Change Local FairScheduler datacenter A Local FairScheduler Global datacenter B Scheduler Local FairScheduler datacenter C On-Demand Resouce Allocation Local FairScheduler datacenter D 41Friday, April 5, 13
  42. 42. Conclusion  Treasure Data • Cloud based Big-data analytics platform • Provide Machete for Big data reporting  Big Data processing • Collect / Store / Analytics / Visualization Our focus!  Our used AWS products • EC2, S3, RDS, ELB • Building Treasure Data specific systems on AWS 42Friday, April 5, 13
  43. 43. Big Data for the Rest of Us www.treasure-data.com | @TreasureDataFriday, April 5, 13

×