情報処理学会 Exciting Coding! Treasure Data

  • 3,972 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,972
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
19
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Time to Value Setup time and load time for data collection (td-agent) – 1 weekAnalysis capabilities out of the boxSimple integration with existing ecosystem (DI & BI)Cloud flexibility and economiesScalable (cloud), extensible (elastic), flexible (schemaless)Lower TCO compared to on-premise, hosted, or homegrownOn-demand ability to scale, adjust, meet future business requirementsSimple and supported“Full” solutions from collection to visualizationGreat customer service, support, setup, and SLAsEasy to extend on your own / self-service – DIY big data
  • Time to Value Setup time and load time for data collection (td-agent) – 1 weekAnalysis capabilities out of the boxSimple integration with existing ecosystem (DI & BI)Cloud flexibility and economiesScalable (cloud), extensible (elastic), flexible (schemaless)Lower TCO compared to on-premise, hosted, or homegrownOn-demand ability to scale, adjust, meet future business requirementsSimple and supported“Full” solutions from collection to visualizationGreat customer service, support, setup, and SLAsEasy to extend on your own / self-service – DIY big data

Transcript

  • 1. Treasure Data 
 Exciting Coding! Nov 2013 Presented by Masahiro Nakagawa Senior Software Engineer www.treasuredata.com 1
  • 2. Who are you •  Masahiro Nakagawa –  @repeatedly –  masa@treasure-data.com or d@ •  Treasure Data, Inc –  Senior Software Engineer •  Fluentd / Client libraries / etc... –  Since 2012/11 •  Open Source projects –  D Programming Language –  MessagePack: D, Python, etc… –  Fluentd: Core, Mongo, Logger, etc… –  Etc… 2
  • 3. Company & Board Meeting Presentation Service Introduction August 15th, 2013 - 3:30PM PDT Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 3
  • 4. Company Background •  Founded 2011 in Mountain View, CA –  The first cloud service for the entire data pipeline –  Including: Acquisition, Storage, & Analysis •  Provide a “Cloud Data Service” –  Fast Time to Value –  Cloud Flexibility and Economics –  Simple and Well Supported The Treasure Data Team Hiro Yoshikawa – CEO Open source business veteran Kaz Ohta – CTO Founder of world’s largest Hadoop Group Jeff Yuan – Director, Engineering LinkedIn, MIT / Michale Stonrebrraker Lab Keith Goldstein – VP Sales & Bus Dev VP of Bus Dev from Tibco and Talend Rich Ghiossi – VP Marketing VP of Marketing from ParAccel Notable Investors •  Treasure Data has over 100+ customers in production –  Incl. Fortune 500 companies –  500+ Billion new records / month –  Around 2 Trillion records under management –  Variety of use cases and verticals Othman Laraki Ex-VP of Growth at Twitter Jerry Yang Founder of Yahoo! Yukihiro “Matz” Matusmoto Creator of “Ruby” programming language James Lindenbaum Founder of Heroku 4
  • 5. Problem Statement •  Lots of companies today produce Big Data by having “New Data Sources” (Sensor, Weblog, etc) –  But few have the resources to build a Big Data Analytics system •  60-70% of a company’s Big Data time & budget consumed by: –  Infrastructure setup & Maintenance –  Building Collection & Storage Flows –  Hiring/Training Hadoop Expertise •  On average, it takes 6 months to get a Hadoop environment into production 5
  • 6. 6
  • 7. Treasure Data’s Focus (80% of the needs) 7
  • 8. 8
  • 9. Treasure Data Service: Overview Acquire Store Analyze Web logs Treasure Agent App logs BI Connectivity Streaming Log ! Collector (JSON)! REST API, SQL, Pig, JDBC / ODBC! Sensor Tableau, Metric Insights, QlikView, Excel, etc. Treasure Data Cloud RDBMS Bulk Import CRM BI Tools Parallel Upload from CSV, MySQL, etc.! Flexible, Scalable, Columnar Storage! ERP Time to Value Economy & Flexibility Result Push REST API, SQL, Pig! Dashboards Custom App, Local DB, FTP Server, etc. Simple & Supported 9
  • 10. Our Value Propositions •  Faster time to value On-demand cloud infrastructure & versatile streaming data collection agent –  Instantly provision a fully tuned & managed infrastructure –  Go live into production on average in 14 days (collection, analytics, & BI) •  Cloud flexibility and economics Fraction of the cost of traditional solutions by leveraging cloud storage and processing, which scales to meet your needs –  Leverage the cost-advantage of the cloud –  Leverage the elasticity of the cloud – scale on demand –  Predictable monthly subscription fee –  No upfront costs & no long-term commitment •  Simple and well supported We are passionate about simplicity, and customer support excellence –  Focus your time on analyzing your data –  Rely on us to keep your data secure & online –  We love making customers successful & building long-term relationships 10
  • 11. Initial Setup & Onboarding – Two Weeks 1. Data Collection 2. Data Storage •  Setup, tuning, and monitoring of Treasure Agent •  Embed Treasure Agent code into applications •  Basic log templates (register, pay, login, etc.) •  Basic KPI queries (DAU, MAU, ARPU, etc.) 3. Data Analysis 4. Service & Support •  Setup dashboards with basic KPIs •  Training on creating customized reports and adhoc querying •  Assigned a dedicated technical account manager •  Real-time support via email, online chat, and call 11
  • 12. Solutions Accelerators … Out-of-the Box Reporting Treasure Data Platform Configured Treasure Agent Solution Components: -  Treasure Data Platform -  Event Collection Template -  Pre-configured Treasure Agent Configuration -  BI Dashboard with KPIs 12
  • 13. - Vision - gle Analytics Platform for the Wo 13
  • 14. Treasure Board Meeting DataPresentation Platform August 15th, 2013 - 3:30PM PDT Architecture Overview Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 14
  • 15. Data Acquisition – Streaming Capture Application Server # Application Code ... ... # Post event to Treasure Data TD.event.post('access', {:uid=>123}) •  Automatic Microbatching •  Local buffering Fallback •  Network Tolerance ... ... Treasure Data Library Java, Ruby, PHP, Perl, Python, Scala, Node.js Treasure Data Cloud Treasure Agent (local) Open-Sourced as Fluentd Project ( http://fluentd.org/ ) 15
  • 16. Data Acquisition – Bulk Loader RDBMS App SaaS CSV, TSV, JSON, MessagePack, Apache, regex, MySQL, FTP FTP Treasure Data Cloud Bulk Loader Prepare ! Upload ! Perform ! Commit 16
  • 17. Data Storage Treasure Data Cloud Default (schema-less) time v 13841604 00 {“ip”:”135.52.211.23”, “code”:”0”} 13841622 00 {“ip”:”45.25.38.156”, “code”:”-1”} 13841640 00 {“ip”:”97.12.76.55”, “code”:”99”} •  Stored “schema-less” as JSON –  Schema can be applied/updated AFTER storage •  Compressed & columnar format SELECT v[‘ip’] as ip, v[‘code’] as code … Schema applied ~30% Faster time ip : string 135.52.211.23 45.25.38.156 97.12.76.55 •  Quickly scale-up processing power –  WITHOUT reloading/redistributing the data -1 138416400 0 •  Optimized for time-based filtering 0 138416220 0 For higher query performance code : int 138416040 0 –  99 SELECT ip, code … 17
  • 18. Data Analysis REST API Treasure Data Cloud Heavy Lifting SQL (Hive): -  Hive’s Built-in UDFs -  TD Added Functions: -  Time Functions -  First, Last, Rank -  Sessionize Scheduled Jobs -  SQL, Pig Scripts -  Data Pushes JDBC Connectivity: -  Custom Java Apps -  Standards-based -  BI Tool Integration Tableau ODBC connector -  Leverages Impala Interactive SQL Push Query Results: Treasure Query Accelerator -  MySQL, PostgreSQL (Impala) -  Google Spreadsheet -  Web, FTP, S3 Scripted Processing (Pig): -  Leftronic, Indicee -  DataFu (LinkedIn) -  Treasure Data Table -  Piggybank (Apache) 18
  • 19. Treasure Board Meeting Presentation Data August 15th, 2013 - 3:30PM PDT General Use Cases Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 19
  • 20. A case: “14 Days” from Signup to Success 1.  Europe’s largest mobile ad exchange. 2.  Serving >60 billion imps/ month for >30,000 mobile apps (Q4 2013) 3.  Immediate need of analytics infrastructure: ASAP! 4.  With TD, MobFox got into production only in 14 days, by one engineer. "Time is the most precious asset in our fast-moving business, and Treasure Data saved us a lot of it." 
 Julian Zehetmayr, CEO & Founder 20
  • 21. A case: “Replace” in-house Hadoop to TD Before 1.  Global “Hulu” - Online Video Service with millions of users 2.  Video contents are distributed to over 150 languages. After 3.  Had hard time maintaining Hadoop cluster 4.  With TD, Viki deprecated their in-house Hadoop cluster and use engineer for core businesses. “Treasure Data has always given us thorough and timely support peppered with insightful tips to make the best use of their service." Huy Nguyen, Software Engineer 21
  • 22. A case: Treasure Data with BI Tool (Tableau) 1.  World’s largest android application market 2.  Serving >3 billion app downloads for >100 million users 3.  Only one engineer managing the data infrastructure 4.  With TD, the data engineer can focus on analyzing data with existing BI tool "I will recommend Treasure Data to my friends in a heartbeat because it benefits all three stakeholders: Operations, Engineering and Business." Simon Dong, Principal Architect - Data Engineering 22
  • 23. Treasure Board Meeting DataPresentation Platform August 15th, 2013 - 3:30PM PDT Fluentd Overview Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 23
  • 24. What is Fluentd? •  Open sourced log collector written in Ruby –  Easy to use, reliable and well performance –  Streaming event processing •  Using rubygems ecosystem to distribute plugins Fluentd the missing log collector fluentd.org 24
  • 25. Data processing pipeline Data source Collect Store Process Visualize Reporting Monitoring 25
  • 26. Data processing pipeline Important but no defacto middleware! Collect Store Data source Process Visualize Reporting Monitoring 26
  • 27. Fluentd general example 2012-02-04 01:33:51 apache.log Web Server { "host": "127.0.0.1", "method": "GET", ... tail 127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1 - - [11/Dec/2012:07:26:27] [11/Dec/2012:07:26:30] [11/Dec/2012:07:26:32] [11/Dec/2012:07:26:40] [11/Dec/2012:07:27:01] ... "GET "GET "GET "GET "GET / / / / / ... ... ... ... ... } Fluentd insert event buffering 27
  • 28. Pluggable Architecture Pluggable Pluggable Output Input > rewrite > ... Engine Buffer > Forward > HTTP > File tail > dstat > ... > File > Memory Output > Forward > File > MongoDB > ... 28
  • 29. Resolve your requirement by writing plugin Access logs Apache Alerting Nagios App logs Frontend Backend Analysis MongoDB MySQL Hadoop System logs syslogd Databases filter / buffer / routing Archiving Amazon S3 29
  • 30. Treasure Agent (td-agent) •  Open sourced distribution package of Fluentd –  ETL part of Treasure Data –  deb / rpm / homebrew •  Including useful components –  Ruby, jemalloc, fluentd –  3rd party gems: td, mongo, webhdfs, etc… –  Init script •  http://packages.treasuredata.com/ 30
  • 31. Fluentd users 31
  • 32. Treasure Board Meeting DataPresentation Platform August 15th, 2013 - 3:30PM PDT Backend Overview Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 32
  • 33. AWS components •  RDS –  Store user information, job, status, etc… –  Queue Worker / Scheduler •  EC2 –  API Server, Hadoop Cluster, Job Worker / Scheduler •  S3 –  Columnar storage •  Realtime / Archive storage •  MessagePack columnar •  ELB 33
  • 34. Plazma(Hadoop, Storage, Queue and Workers) Frontend Worker Hadoop Queue Hadoop Applications push metrics to Fluentd (via local Fluentd) Treasure Data for historical analysis Fluentd Fluentd sums up data minutes (partial aggregation) Librato Metrics for realtime analysis 34
  • 35. Treasure Board Meeting Presentation Data August 15th, 2013 - 3:30PM PDT Development Philosophy Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 35
  • 36. Open-Source Culture •  TD prefers engineers, who are contributing to the OSS products –  MessagePack, Fluentd, ZeroMQ, Hadoop, MongoDB, Angular.js, Huahin, D-Lang, etc. –  https://github.com/treasure-data?tab=members •  Reasons –  Fixing & Improving the other people’s code is crucial for the distributed team. –  TD’s engineering workflow is really similar with OSS product workflow. –  A+ OSS engineers will bring another A+ OSS engineer! 36
  • 37. OSS v.s. Proprietary •  OSS Everything on the Client Side –  http://github.com/treasure-data/ –  http://fluentd.org/ •  TD is helping the world to collect more data in an analytics-ready format •  2000+ companies (e.g. Nintendo, SlideShare/LinkedIn) are using as OSS product. 3-4% of the users are TD’s customer. •  We also leverage other OSS products as much as possible. •  Closed Source on the Cloud Side –  The core value must be a proprietary to sustain as a business. –  The components can be OSS, but the most of the system will remain proprietary to create the value chain. 37
  • 38. How to decide Product Roadmap? •  Solving the Customer Pain is the #1 Priority –  Developers directly provide the support for customers, and spending 30%-40% of the development time to talk with customers –  Developers are the BEST person to come up with the solution. –  # of code lines != value •  Suffering Oriented Development –  First, make it possible –  Then, make it beautiful –  Then, make it fast •  The Largest Customer Pain is NOT always applicable to other customers. –  Need to be brave to say NO. NO. NO. NO. NO…. •  TD doesn’t have 1-year Product Roadmap. Having 3-months roadmap accelerates the development, and other teams (marketing / sales), too. 38
  • 39. Distributed Team (International) •  13 Engineers as of Nov. 2013 –  5 Engineers in Tokyo, Japan –  8 Engineers in Mountain View, USA –  40% of the whole company •  Asynchronous Communication –  Use async communication tools as much as possible: Chat, JIRA, Email, Github, etc. –  Use video conferencing for weekly sync-up •  English is the primary communication language –  If you cannot speak English, your value is nearly zero at Treasure Data engineering team. 39
  • 40. Distributed Team (Deployment) •  Predictable Deployment Cycle –  Weekly Deployment •  Continuous Deployment didn’t fit into B2B SaaS application, our customers want predictability of the changes. •  As a distributed team, it’s hard to track the every changes + deployment status. –  Track every changes on JIRA, and QA engineer is responsible for the deployment too. •  Continuous Deployment for Staging –  Single branch, always automatically deployed to the staging environment –  Monitoring is a continuous testing •  On-Call Alert Schedule, based on the Timezone –  No need to get up around 3am 40
  • 41. Leverage Cloud Services •  Use Cloud Services as Much as Possible –  Don’t hire people, use cloud services. –  Out source everything, except your core value. –  Developers tend to forget his own cost. If you spend 1-hour, it already costs around $50 as a company. •  Examples –  –  –  –  –  –  –  –  –  –  EC2 (IaaS) CopperEgg (Infrastructure Monitoring) NewRelic (Application Performance Management) Hosted Chef (Configuration Management) Librato Metrics (Application Metrics) Pager Duty (Alerting) Logentries (Log Search) CircleCI, TravisCI (Continuous Integration) HipChat, JIRA, Confluence (Development Tool) Etc…. 41
  • 42. Treasure Board Meeting Presentation Data Conclusion August 15th, 2013 - 3:30PM PDT Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 42
  • 43. Key points •  Treasure Data, Inc –  Cloud based Data Service for the world –  Customer oriented development •  Our Unique Products and Culture –  Fluend / Plazma (backend) –  OSS enthusiast •  Use Cloud or not? –  Cloud leverages an idea but not differentiator –  Focus own vision! 43