Your SlideShare is downloading. ×
The architecture of data analytics PaaS on AWS
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

The architecture of data analytics PaaS on AWS

4,724
views

Published on

JAWS Days 2nd day: Treasure Data presentation. …

JAWS Days 2nd day: Treasure Data presentation.

http://jaws-ug.jp/jawsdays2013/speaker.html#DEV-01

Ustream: http://www.ustream.tv/recorded/30009634

Published in: Technology

0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,724
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
15
Comments
0
Likes
21
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Treasure Data The architecture of data analytics PaaS on AWS Masahiro Nakagawa JAWS Days: 2013/03/16Friday, April 5, 13
  • 2. Who are you?  Masahiro Nakagawa • @repeatedly / masa@treasure-data.com  Treasure Data, Inc. • Senior Software Engineer, since 2012/11  Open Source projects • D Programming Language • MessagePack: D, Python, etc... • Fluentd: Core, mongo, etc... • etc... 2Friday, April 5, 13
  • 3. Introduction to Treasure DataFriday, April 5, 13
  • 4. Company Overview  Silicon Valley-based Company • All Founders are Japanese • Hironobu Yoshikawa • Kazuki Ohta • Sadayuki Furuhashi  OSS Enthusiasts • MessagePack, Fluentd, etc. 4Friday, April 5, 13
  • 5. Investors  Bill Tai  Naren Gupta - Nexus Ventures, Director of Redhat, TIBCO  Othman Laraki - Former VP Growth at Twitter  James Lindenbaum, Adam Wiggins, Orion Henry - Heroku Founders  Anand Babu Periasamy, Hitesh Chellani - Gluster Founders  Yukihiro “Matz” Matsumoto - Creator of Ruby  Dan Scheinman - Director of Arista Networks  Jerry Yang - Founder of Yahoo!  + 10 more people • and.... 5Friday, April 5, 13
  • 6. Treasure Data = Cloud + Big Data Cloud Big Data-as-a-Service Database-as-a-service Enterprise Lightweight RDBMS Traditional RDBMS Data Warehouse DB2 On-Premise $34B $10B market market 1Bil entry Data Volume Or 10TB © 2012 Forrester Research, Inc. Reproduction Prohibited 6Friday, April 5, 13
  • 7. Why Cloud? ‘Time’ is Money Ideal Customer Expectation Value Obsolete over time Reality (On-Premise) Upgrade HW/SW Selection, PoC, Deploy... Time Sign-up or PO 7Friday, April 5, 13
  • 8. Big Data Adoption Stages Optimization What’s the best? Predictive Analysis What’s a trend? Analytics Statistical Analysis Treasure Data’s FOCUS Why? Alerts Error?(80% of needs) Drill Down Query Where exactly? Reporting Ad-hoc Reports Where? Standard Reports What happened? Intelligence Sophistication 8Friday, April 5, 13
  • 9. Full Stack Support for Big Data Reporting Our best-in-class architecture Data from almost any source and operations team ensure the can be securely and reliably integrity and availability of your uploaded using td-agent in data. streaming or batch mode. Our SQL, REST, JDBC, ODBC You can store gigabytes to and command-line interfaces petabytes of data efficiently and support all major query tools securely in our cloud-based and approaches. columnar datastore. 9Friday, April 5, 13
  • 10. Vision: Single Analytics Platform for the World 10Friday, April 5, 13
  • 11. 11 Our Customers – Fortune Global 500 leaders and start-ups including:Friday, April 5, 13
  • 12. Treasure Data’s Service ArchitectureFriday, April 5, 13
  • 13. Treasure Data = Collect + Store + Query 13Friday, April 5, 13
  • 14. Example in AdTech: MobFox 1. Europe’s largest independent mobile ad exchange. 2. 20 billion imps/month (circa Jan. 2013) 3. Serving ads for 15,000+ mobile apps (circa Jan. 2013) 4. Needed Big Data Analytics infrastructure ASAP. 14Friday, April 5, 13
  • 15. Two Weeks From Start to Finish! 15Friday, April 5, 13
  • 16. Used AWS Products (1)  RDS • Store user information, job status, etc... • Store metadata of our columnar database • Queue of worker (perfectqueue / perfectsched)  EC2 • API servers • Hadoop clusters • Job workers • Using Chef to deploy 16Friday, April 5, 13
  • 17. Used AWS Products (2)  ELB • Load balancing of API servers • Load balancing of td-agents  S3 • Columnar storage built on top of S3 • MessagePack columnar format • realtime / archive storage • Our Result feature supports S3 output. No EMR, SQS and other products ! 17Friday, April 5, 13
  • 18. Architecture Breakdown Data Collection Data Store/Analytics Connectivity • Increasing variety of • Remaining complexity in • Required to ensure data sources both traditional DWH connectivity with • No single data schema and Hadoop (very slow existing BI/visualization/ • Lack of streaming data time to market) apps by JDBC, REST collection method • Challenges in scaling and ODBC. • 60% of Big Data project data volume and • Output ot other services, resource consumed expanding cost. e.g. S3, RDBMS, etc. 18Friday, April 5, 13
  • 19. 1) Data Collection  60% of BI project resource is consumed here  Most ‘underestimated’ and ‘unsexy’ but MOST important  Fluentd: OSS lightweight but robust Log Collector • http://fluentd.org/ 19Friday, April 5, 13
  • 20. Fluentd the missing log collector fluentd.org 20Friday, April 5, 13
  • 21. In short  Open sourced log collector written in Ruby  Using rubygems ecosystem for plugins It’s like syslogd, but uses JSON for log messages 21Friday, April 5, 13
  • 22. Time 2012-02-04 01:33:51 Apache Tag apache.log Record { "host": "127.0.0.1", tail "method": "GET", "path": "/", write ... } insert 127.0.0.1 127.0.0.1 127.0.0.1 - - - - - - [11/Dec/2012:07:26:27] [11/Dec/2012:07:26:30] [11/Dec/2012:07:26:32] "GET "GET "GET / / / ... ... ... Fluentd 127.0.0.1 - - [11/Dec/2012:07:26:40] "GET / ... 127.0.0.1 - - [11/Dec/2012:07:27:01] "GET / ... ... event buffering Mongo 22Friday, April 5, 13
  • 23. Architecture Pluggable Pluggable Pluggable Input Buffer Output > Forward > Memory > Forward > HTTP > File > File > File tail > Amazon S3 > dstat > MongoDB > ... > ... 23Friday, April 5, 13
  • 24. Before Fluentd Server1 Server2 Server3 Application Application Application ・・・ ・・・ ・・・ High Latency! must wait for a day... Fluent Log Server 24Friday, April 5, 13
  • 25. After Fluentd Server1 Server2 Server3 Application Application Application Fluentd ・・・ Fluentd ・・・ Fluentd ・・・ In streaming! Fluentd Fluentd 25Friday, April 5, 13
  • 26. Access logs Alerting Apache Nagios App logs Analysis Frontend MongoDB Backend MySQL System logs Hadoop syslogd filter / buffer / routing Archiving Databases Amazon S3 26Friday, April 5, 13
  • 27. td-agent  Open sourced distribution package of fluentd  ETL part of Treasure Data  Including useful components • ruby, jemalloc, fluentd • 3rd party gems: td, mongo, webhdfs, etc... • td plugin is for Treasure Data  http://packages.treasure-data.com/ 27Friday, April 5, 13
  • 28. Treasure Data Service Architecture This! Apache App Treasure Data td-agent columnar data App RDBMS warehouse Other data sources MAPREDUCE JOBS HIVE, PIG (to be supported) td-command Query Query Processing API JDBC, REST Cluster User BI apps 28Friday, April 5, 13
  • 29. AWS plugins  S3  SNS  SQS  DynamoDB  foward-aws  RDS http://fluentd.org/plugin/  RedShift  CloudWatch  Yet Another Cloud Watch  CloudWatch Lite 29Friday, April 5, 13
  • 30. 2) Data Store / Analytics - Columnar Storage 30Friday, April 5, 13
  • 31. Treasure Data Service Processing Flow Worker Frontend Job Queue Hadoop Hadoop Applications push metrics to Fluentd sums up data minutes (via local Fluentd) Fluentd Fluentd (partial aggregation) Treasure Librato Metrics Data for historical analysis for realtime analysis 31Friday, April 5, 13
  • 32. Friday, April 5, 13
  • 33. Structure of Columnar Storages import bulk import SELECT ... Import Storage Bulk Import Storage Realtime Storage Archive Storage merge (every 1 hour) 23c82b0ba3405d4c15aa85d2190e 2013-03-15 00:23:00 912ec80 6d7b1482412ab14f0332b8aee119 2013-03-16 00:01:00 277a259 8a7bc848b2791b8fd603c719e54f ... 0e3d402b17638477c9a7977e7dab ... 33Friday, April 5, 13
  • 34. Query Language Query Execution Columnar Data Object Storage 34Friday, April 5, 13
  • 35. 1/4: Compile SQL into MapReduce SQL Statement SELECT COUNT(DISTINCT ip) FROM tbl; Hive SQL - to - MapReduce 35Friday, April 5, 13
  • 36. 2/4: MapReduce is executed in parallel SELECT COUNT(DISTINCT ip) FROM tbl; cc2.8xlarge cluster compute instance (up to 100 nodes * 32 threads) 36Friday, April 5, 13
  • 37. 3/4: Columnar Data Access SELECT COUNT(DISTINCT ip) FROM tbl; 10Gbps Network Read ONLY the Required Part of Data 37Friday, April 5, 13
  • 38. 4/4: Object-based Storage 38Friday, April 5, 13
  • 39. Data first, Schema later SELECT 54 (int) “test” (string) 120 (int) NULL Schema user:int name:string value:int host:int Raw data(JSON) {“user”:54, “name”:”test”, “value”:”120”, “host”:”local”} 39Friday, April 5, 13
  • 40. 3) Connectivity REST API td-command Query Query Query API Processing JDBC, ODBC Driver Cluster BI apps Web App Treasure Data Result MySQL Columnar Storage S3 … 40Friday, April 5, 13
  • 41. Multi-Tenancy  All customers share the Hadoop clusters (Multi Data Centers)  Resource Sharing (Burst Cores), Rapid Improvement, Ease of Upgrade Job Submission + Plan Change Local FairScheduler datacenter A Local FairScheduler Global datacenter B Scheduler Local FairScheduler datacenter C On-Demand Resouce Allocation Local FairScheduler datacenter D 41Friday, April 5, 13
  • 42. Conclusion  Treasure Data • Cloud based Big-data analytics platform • Provide Machete for Big data reporting  Big Data processing • Collect / Store / Analytics / Visualization Our focus!  Our used AWS products • EC2, S3, RDS, ELB • Building Treasure Data specific systems on AWS 42Friday, April 5, 13
  • 43. Big Data for the Rest of Us www.treasure-data.com | @TreasureDataFriday, April 5, 13