Successfully reported this slideshow.
Your SlideShare is downloading. ×

Treasure Data Cloud Strategy

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Treasure Data and Fluentd
Treasure Data and Fluentd
Loading in …3
×

Check these out next

1 of 59 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Treasure Data Cloud Strategy (20)

More from Treasure Data, Inc. (20)

Advertisement

Recently uploaded (20)

Treasure Data Cloud Strategy

  1. 1. Treasure Data Cloud Strategy Masahiro Nakagawa July Tech Festa: Jul 14, 2013 Sunday, July 14, 13
  2. 2. Who are you? § Masahiro Nakagawa • @repeatedly / masa@treasure-data.com § Treasure Data, Inc. • Senior Software Engineer, since 2012/11 § Open Source projects • D Programming Language • MessagePack: D, Python, etc... • Fluentd: Core, Mongo, Logger, etc... • etc... 2 Sunday, July 14, 13
  3. 3. Treasure Data overview Sunday, July 14, 13
  4. 4. Company Overview § Silicon Valley-based Company • All Founders are Japanese • Hironobu Yoshikawa • Kazuki Ohta • Sadayuki Furuhashi • About 20 people • Over 3.5 million jobs § OSS Enthusiasts • MessagePack, Fluentd, etc. 4 Sunday, July 14, 13
  5. 5. Investors § Bill Tai § Othman Laraki - Former VP Growth at Twitter § James Lindenbaum, Adam Wiggins, Orion Henry - Heroku Founders § Anand Babu Periasamy, Hitesh Chellani - Gluster Founders § Yukihiro “Matz” Matsumoto - Creator of Ruby § Dan Scheinman - Director of Arista Networks § Jerry Yang - Founder of Yahoo! 5 Sunday, July 14, 13
  6. 6. 6 DataVolume Cloud Enterprise RDBMSLightweight RDBMS DB2 1Bil entry Or 10TB Traditional Data Warehouse $10B market $34B market Database-as-a-service Big Data-as-a-Service On-Premise © 2012 Forrester Research, Inc. Reproduction Prohibited Treasure Data = Cloud + Big Data Sunday, July 14, 13
  7. 7. The Problem with Other Solutions 7 Customer Value Time Sign-up or PO On-Premise Solutions Obsolescence over time Treasure Data Fully integrated Big Data full- stack service with simple interface, low friction initial engagement & continuous technical upgrade Need Upgrade AWS (or hosted Hadoops)EC2 EMR RedShift S3 Step-by-step manual integrations Maintain NO SpecialistsTOO LONG to get Live = Complex Solutions + Data Collection + Sunday, July 14, 13
  8. 8. 8 Big Data Adoption Stages Intelligence Sophistication Standard Reports Ad-hoc Reports Drill Down Query Alerts Statistical Analysis Predictive Analysis Optimization What happened? Where? Where exactly? Error? Why? What’s a trend? What’s the best? Analytics Reporting Sunday, July 14, 13
  9. 9. 8 Big Data Adoption Stages Intelligence Sophistication Standard Reports Ad-hoc Reports Drill Down Query Alerts Statistical Analysis Predictive Analysis Optimization What happened? Where? Where exactly? Error? Why? What’s a trend? What’s the best? Analytics Reporting Treasure Data’s FOCUS (80% of needs) Sunday, July 14, 13
  10. 10. 9 Full Stack Support for Big Data Reporting Our best-in-class architecture and operations team ensure the integrity and availability of your data. Data from almost any source can be securely and reliably uploaded using td-agent in streaming or batch mode. Our SQL, REST, JDBC, ODBC and command-line interfaces support all major query tools and approaches. You can store gigabytes to petabytes of data efficiently and securely in our cloud-based columnar datastore. Sunday, July 14, 13
  11. 11. We are... 10 Big Data as a Service not Hadoop on Cloud Sunday, July 14, 13
  12. 12. Columnar Storage + Hadoop MapReduce 600 bil+ records 3.5 mil+ jobs Product 11 Data Collection Data Warehouse Data Analysis Open-Source Log Collector 2,500+ companies (incl. LinkedIn, etc) Bulk Loader CSV / TSV MySQL, Postgres Oracle, etc. Web Log App Log Sensor RDBMS CRM ERP Streaming Upload 60billion / month BI Tools Tableau, QlickView, Pentaho, Excel, etc. REST JDBC / ODBC SQL (HiveQL) or Pig Bulk Upload Parallel Upload Value Proposition: “Time-to-Answer” 20bil+, 2 weeks, UK/Austria 3bil+, 3 weeks Singapore 2 weeks, US 2 weeks, US 3 weeks, Japan Dashboard Custom App, RDBMS, FTP, etc. Result push Multi-Tenant: Single Code for Everyone - Improving the Platform Faster (e.g. SFDC, Heroku) Sunday, July 14, 13
  13. 13. 12 Our Customers – 80 companies http://docs.treasure-data.com/categories/success-stories Sunday, July 14, 13
  14. 14. 13 A case: “14 Days” from Signup to Success 1. Europe’s largest mobile ad exchange. 2. Serving >20 billion imps/ month for >15,000 mobile apps (Q1 2013) 3. Immediate need of analytics infrastructure: ASAP! 4. With TD, MobFox got into production only in 14 days, by one engineer. "Time is the most precious asset in our fast-moving business, and Treasure Data saved us a lot of it." Julian Zehetmayr, CEO & Founder td-agent = fluentd rpm/deb Sunday, July 14, 13
  15. 15. 14 A case: “Replace” in-house Hadoop to TD 1. Global “Hulu” - Online Video Service with millions of users 2. Video contents are distributed to over 150 languages. 3. Had hard time maintaining Hadoop cluster 4. With TD, Viki deprecated their in-house Hadoop cluster and use engineer for core businesses. Before After “Treasure Data has always given us thorough and timely support peppered with insightful tips to make the best use of their service." Huy Nguyen, Software Engineer Sunday, July 14, 13
  16. 16. 15 A case: Treasure Data with BI Tool (Tableau) 1. World’s largest android application market 2. Serving >3 billion app downloads for >100 million users 3. Only one engineer managing the data infrastructure 4. With TD, the data engineer can focus on analyzing data with existing BI tool "I will recommend Treasure Data to my friends in a heartbeat because it benefits all three stakeholders: Operations, Engineering and Business." Simon Dong, Principal Architect - Data Engineering Sunday, July 14, 13
  17. 17. 16 - Vision - Single Analytics Platform for the World http://www.chisite.org/initiatives/WGII Sunday, July 14, 13
  18. 18. Treasure Data’s Service Architecture Sunday, July 14, 13
  19. 19. 18 Treasure Data = Collect + Store + Query Sunday, July 14, 13
  20. 20. 19 Architecture Breakdown Data Collection • Increasing variety of data sources • No single data schema • Lack of streaming data collection method • 60% of Big Data project resource consumed Data Store/Analytics • Remaining complexity in both traditional DWH and Hadoop (very slow time to market) • Challenges in scaling data volume and expanding cost. Connectivity • Required to ensure connectivity with existing BI/visualization/ apps by JDBC, ODBC and REST. • Output ot other services, e.g. S3, RDBMS, etc. Sunday, July 14, 13
  21. 21. Product Philosophy § Data first, Schema later • “Schema-on-Read” • Both Batch and Query processing § Simple APIs • Easy to use and powerful § Easy integration • Log collecting, BI tools and etc... 20 Sunday, July 14, 13
  22. 22. Our technology stack § td-agent • ETL part of Treasure Data § Plazma • Big data processing infrastructure • Columnar oriented storage • Reliable data handling § Multi-tenant scheduler • Robust distributed queue and scheduler 21 Sunday, July 14, 13
  23. 23. § 60% of BI project resource is consumed here § Most ‘underestimated’ and ‘unsexy’ but MOST important § Fluentd: OSS lightweight but robust Log Collector • http://fluentd.org/ 1) Data Collection 22 Sunday, July 14, 13
  24. 24. Apache App App Other data sources td-agent RDBMS Treasure Data columnar data warehouse Query Processing Cluster Query API HIVE, PIG JDBC, REST User td-command BI apps 23 This! Sunday, July 14, 13
  25. 25. fluentd.org Fluentd the missing log collector 24 Sunday, July 14, 13
  26. 26. Data Processing Collect Store Process Visualize Data source Reporting Monitoring Sunday, July 14, 13
  27. 27. Store Process Cloudera Horton Works Treasure Data Collect Visualize Tableau Excel R easier & shorter time ??? Related Products Sunday, July 14, 13
  28. 28. In short § Open sourced log collector written in Ruby • Easy to use, reliable and well performance • like streaming event processing § Using rubygems ecosystem for plugins 27 It’s like syslogd, but uses JSON for log messages Sunday, July 14, 13
  29. 29. tail insert event buffering 127.0.0.1 - - [11/Dec/2012:07:26:27] "GET / ... 127.0.0.1 - - [11/Dec/2012:07:26:30] "GET / ... 127.0.0.1 - - [11/Dec/2012:07:26:32] "GET / ... 127.0.0.1 - - [11/Dec/2012:07:26:40] "GET / ... 127.0.0.1 - - [11/Dec/2012:07:27:01] "GET / ... ... 28 Fluentd Web Server Example (apache to monogdb) 2012-12-11 07:26:27 apache.log { "host": "127.0.0.1", "method": "GET", ... } Sunday, July 14, 13
  30. 30. Application ・・・ Server2 Application ・・・ Server3 Application ・・・ Server1 FluentLog Server High Latency! must wait for a day... 29 Before Fluentd Sunday, July 14, 13
  31. 31. Application ・・・ Server2 Application ・・・ Server3 Application ・・・ Server1 Fluentd Fluentd Fluentd Fluentd Fluentd In streaming! 30 After Fluentd Sunday, July 14, 13
  32. 32. Buffer Output Input > Forward > HTTP > File tail > dstat > ... > Forward > File > MongoDB > ... > File > Memory 31 Pluggable architecture Engine Output > rewrite > ... Pluggable Pluggable Sunday, July 14, 13
  33. 33. Nagios MongoDB Hadoop Alerting Amazon S3 Analysis Archiving MySQL Apache Frontend Access logs syslogd App logs System logs Backend Databases buffer / filter / routing 32 Sunday, July 14, 13
  34. 34. td-agent § Open sourced distribution package of Fluentd • ETL part of Treasure Data • rpm, deb and homebrew § Including useful components • ruby, jemalloc, fluentd • 3rd party gems: td, mongo, webhdfs, etc... • td plugin is for Treasure Data § http://packages.treasure-data.com/ 33 Sunday, July 14, 13
  35. 35. § Remaining complexity in both DWH and Hadoop § Challenges in scaling data volume and expanding cost § Plazma: Hadoop eco system and own projects 2) Data Store / Analytics 34 Sunday, July 14, 13
  36. 36. Apache App App Other data sources td-agent RDBMS Treasure Data columnar data warehouse Query Processing Cluster Query API HIVE, PIG JDBC, REST User td-command BI apps 35 This! Sunday, July 14, 13
  37. 37. AWS Component Dependencies (1) § RDS • Store user information, job status, etc... • Store metadata of our columnar database • Queue worker / Scheduler § EC2 • API servers (Ruby on Rails 3) • Hadoop clusters • Job workers • Using Chef to deploy 36 Sunday, July 14, 13
  38. 38. AWS Component Dependencies (2) § ELB • Load balancing of API servers • Load balancing of td-agents § S3 • Columnar storage built on top of S3 • MessagePack columnar format • Realtime / Archive storage • Our Result feature supports S3 output. 37 No EBS, EMR, SQS and other products ! Sunday, July 14, 13
  39. 39. Frontend Queue Worker Hadoop Fluentd Applications push metrics to Fluentd (via local Fluentd) Librato Metrics for realtime analysis Treasure Data for historical analysis Fluentd sums up data minutes (partial aggregation) Treasure Data Service Processing Flow 38 Hadoop Sunday, July 14, 13
  40. 40. 39 Data Processing Flow Sunday, July 14, 13
  41. 41. Structure of Columnar Storages Realtime Storage merge (every 1 hour) 2013-07-12 00:23:00 912ec80 2013-07-13 00:01:00 277a259 2013-07-14 00:02:00 d52c831 ... 23c82b0ba3405d4c15aa85d2190e 6d7b1482412ab14f0332b8aee119 8a7bc848b2791b8fd603c719e54f 0e3d402b17638477c9a7977e7dab ... SELECT ... Archive Storage Data import 40 Sunday, July 14, 13
  42. 42. Query Language Query Execution Columnar Data Object Storage 41 Sunday, July 14, 13
  43. 43. 1/4: Compile SQL into MapReduce SELECT COUNT(DISTINCT ip) FROM tbl; SQL Statement Hive SQL - to - MapReduce 42 +TD UDFs Sunday, July 14, 13
  44. 44. 2/4: MapReduce is executed in parallel SELECT COUNT(DISTINCT ip) FROM tbl; 43 Sunday, July 14, 13
  45. 45. 3/4: Columnar Data Access Read ONLY the Required Part of Data SELECT COUNT(DISTINCT ip) FROM tbl; 44 Sunday, July 14, 13
  46. 46. 4/4: Object-based Storage 45 Sunday, July 14, 13
  47. 47. Apply Schema {“user”:54, “name”:”test”, “value”:”120”, “host”:”local”} Schema user:int name:string value:int SELECT 54 (int) Raw data(JSON) “test” (string) 120 (int) host:int NULL 46 Sunday, July 14, 13
  48. 48. Multi-Tenancy § All customers share the Hadoop clusters (Multi Data Centers) § Resource Sharing (Burst Cores), Rapid Improvement, Ease of Upgrade 47 datacenter A datacenter B datacenter C datacenter D Local FairScheduler Local FairScheduler Local FairScheduler Local FairScheduler Global Scheduler On-Demand Resouce Allocation Job Submission + Plan Change Sunday, July 14, 13
  49. 49. Trial and error on Cloud § Rapid development • Change hardware • New architecture testing • Performance testing • Change software • Hadoop parameters • etc... § Use git and chef for these purposes • Easy to deploy and apply changes • git for change history 48 Sunday, July 14, 13
  50. 50. § Services • CopperEgg • Librato Metrics • Logentries • NewRelic • PagerDuty • Desk.com • Olark • HipChat • Alerting Our Operation Stack: Full Use of SaaS 49 § Tools • Hosted Chef (Opscode) • Jenkins • including integration test 44 Sunday, July 14, 13
  51. 51. Sunday, July 14, 13
  52. 52. Sunday, July 14, 13
  53. 53. Sunday, July 14, 13
  54. 54. 53 3) Connectivity § Need to visualize the query result § Use metrics / graph for interactive comparison § Result: Export result and use existence tools 45 Sunday, July 14, 13
  55. 55. Apache App App Other data sources td-agent RDBMS Treasure Data columnar data warehouse Query Processing Cluster Query API HIVE, PIG JDBC, REST User td-command BI apps 54 This! Sunday, July 14, 13
  56. 56. 55 Pull and Push approaches Query (Pull) Web App MySQL Treasure Data Columnar Storage Query Processing Cluster Query API REST API JDBC, ODBC Driver td-command BI apps S3 Result (Push) … Sunday, July 14, 13
  57. 57. Support list 56 § Result • Treasure Data • MySQL • PostgreSQL • Google SpreadSheet • REST API • S3 • etc... § BI tool • Pentaho • Tableau • JasperSoft • Indicee • Dr. Sum • Metric Insight • etc... http://docs.treasure-data.com/categories/3rd-party-tools-overview http://docs.treasure-data.com/categories/result Sunday, July 14, 13
  58. 58. § Treasure Data • Cloud based Big-data analytics platform • Provide Machete for Big data reporting § Big Data processing • Collect / Store / Analytics / Visualization § Consider trade-off • Cloud reinforces idea but not differentiator • What is the strong point? • Should focus own vision! Conclusion 57 Our focus! Sunday, July 14, 13
  59. 59. Big Data for the Rest of Us www.treasure-data.com | @TreasureData Sunday, July 14, 13

×