Data-Driven Development Era
and Its Technologies
Developers Summit 2015 Autumn (Oct 14, 2015)
Satoshi Tagomori (@tagomoris)
Satoshi "Moris" Tagomori
(@tagomoris)
Fluentd, Norikra, Hadoop, ...
Treasure Data, Inc.
HQ
Branch
http://www.treasuredata.com/
Main Topics around "Data"
• Data collection
• Storage
• Data processing
• Batch distributed processing
• Stream processing
• Machine Learning
• Near real-time query & Data lake
• Visualization
Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring
Where before What
Using Services or Not
• Using services fully-managed:
• Google BigQuery & Dataflow
• Treasure Data services
• Using services self-managed:
• Amazon EMR & Redshift
• Google Cloud Dataproc
• Using your own environment & cluster
Using Services or Not
• Using services fully-managed:
• Google BigQuery & Dataflow
• Treasure Data services
• Using services self-managed:
• Amazon EMR & Redshift
• Google Cloud Dataproc
• Using your own environment & cluster
a bit more cost
extremely less efforts
fully controlled by self
extremely more efforts
less cost
less efforts
Using Services or Not:
"Use Services!"
To concentrate
DATA and Analytics,
NOT tools
Why should we use services?
• About distributed systems:
• hard to operate & upgrade
• impossible to "small-start"
• very hard to hire professional engineer
• Data Driven Development:
• collect/store data at first!
• consider output data at second!
• "before building your own environment"
Really? Are you TD guy?
• ...Really!
• But it requires very long discussions :P
• "スタートアップのデータ処理基盤、作るか、使うか"

http://tsuchinoko.dmmlabs.com/?p=1770
How to choose software/services
in
Data-Driven Development
"What" decides "How"
• Distributed systems are to solve problems
• There're many kind of data
• There're many problems
• Systems solve different problems from each other
• There are no "Silver bullet"!
What First, How Second
• What do you want to do?
• Reporting? Analytics? Recommendation? or ...
• What type of data you wan to process?
• Stored large log? Stream sensor data? or ...
• What is you need as result?
• CSV? Spreadsheet? Graph? DB Relation? or ...
How?(just for example)
• MapReduce, Tez
• Large batch jobs, big JOINs, high stability
• Spark
• Small/Middle batch jobs, machine learning
• Impala, Presto, Drill, Redshift, BigQuery
• Near-real-time search, small-to-large analytics
• Storm, Spark streaming
• Stream data conversion/aggregation
"Processing" is just a part
of whole dataflow!
Data Analytics Flow (again)
Collect Store Process Visualize
Data source
Reporting
Monitoring
Data Analytics Flow (again)
Collect Store Process Visualize
Data source
Reporting
Monitoring
Data Collection
• Data Driven Development -> collect at first!
• As batch: Data already exists as files
• Easily integrated with existing batch systems
• Sqoop, Embulk, ...
• As stream: Data just generated now
• Easily connected with monitoring systems
• Without burst network traffic
• Flume, Logstash, Fluentd, ...
Fluentd: Support Service
by SRA OSS
with Treasure Data
Released
TODAY!
Other Important Topics
• Storage: Performance, Availability, Schema management
• Apache Hadoop HDFS, Apache HBase, Amazon S3, Cloudera Kudu, ...
• Visualization: Functionality, Connectivity, Visibility
• Tableau, Pentaho, Many other enterprise products, ...
• Distributed Queues: Performance, Stability, Connectivity
• Apache Kafka, Amazon Kinesis, ...
Get Familiar with Options
NOT to Take Pains about Technology!
Concentrate
DATA and Analytics,
NOT tools.
Thanks!

Data-Driven Development Era and Its Technologies

  • 1.
    Data-Driven Development Era andIts Technologies Developers Summit 2015 Autumn (Oct 14, 2015) Satoshi Tagomori (@tagomoris)
  • 2.
    Satoshi "Moris" Tagomori (@tagomoris) Fluentd,Norikra, Hadoop, ... Treasure Data, Inc.
  • 4.
  • 5.
  • 6.
    Main Topics around"Data" • Data collection • Storage • Data processing • Batch distributed processing • Stream processing • Machine Learning • Near real-time query & Data lake • Visualization
  • 7.
    Data Analytics Flow CollectStore Process Visualize Data source Reporting Monitoring
  • 8.
  • 9.
    Using Services orNot • Using services fully-managed: • Google BigQuery & Dataflow • Treasure Data services • Using services self-managed: • Amazon EMR & Redshift • Google Cloud Dataproc • Using your own environment & cluster
  • 10.
    Using Services orNot • Using services fully-managed: • Google BigQuery & Dataflow • Treasure Data services • Using services self-managed: • Amazon EMR & Redshift • Google Cloud Dataproc • Using your own environment & cluster a bit more cost extremely less efforts fully controlled by self extremely more efforts less cost less efforts
  • 11.
    Using Services orNot: "Use Services!" To concentrate DATA and Analytics, NOT tools
  • 12.
    Why should weuse services? • About distributed systems: • hard to operate & upgrade • impossible to "small-start" • very hard to hire professional engineer • Data Driven Development: • collect/store data at first! • consider output data at second! • "before building your own environment"
  • 13.
    Really? Are youTD guy? • ...Really! • But it requires very long discussions :P • "スタートアップのデータ処理基盤、作るか、使うか"
 http://tsuchinoko.dmmlabs.com/?p=1770
  • 14.
    How to choosesoftware/services in Data-Driven Development
  • 15.
    "What" decides "How" •Distributed systems are to solve problems • There're many kind of data • There're many problems • Systems solve different problems from each other • There are no "Silver bullet"!
  • 16.
    What First, HowSecond • What do you want to do? • Reporting? Analytics? Recommendation? or ... • What type of data you wan to process? • Stored large log? Stream sensor data? or ... • What is you need as result? • CSV? Spreadsheet? Graph? DB Relation? or ...
  • 17.
    How?(just for example) •MapReduce, Tez • Large batch jobs, big JOINs, high stability • Spark • Small/Middle batch jobs, machine learning • Impala, Presto, Drill, Redshift, BigQuery • Near-real-time search, small-to-large analytics • Storm, Spark streaming • Stream data conversion/aggregation
  • 18.
    "Processing" is justa part of whole dataflow!
  • 19.
    Data Analytics Flow(again) Collect Store Process Visualize Data source Reporting Monitoring
  • 20.
    Data Analytics Flow(again) Collect Store Process Visualize Data source Reporting Monitoring
  • 21.
    Data Collection • DataDriven Development -> collect at first! • As batch: Data already exists as files • Easily integrated with existing batch systems • Sqoop, Embulk, ... • As stream: Data just generated now • Easily connected with monitoring systems • Without burst network traffic • Flume, Logstash, Fluentd, ...
  • 22.
    Fluentd: Support Service bySRA OSS with Treasure Data Released TODAY!
  • 23.
    Other Important Topics •Storage: Performance, Availability, Schema management • Apache Hadoop HDFS, Apache HBase, Amazon S3, Cloudera Kudu, ... • Visualization: Functionality, Connectivity, Visibility • Tableau, Pentaho, Many other enterprise products, ... • Distributed Queues: Performance, Stability, Connectivity • Apache Kafka, Amazon Kinesis, ...
  • 24.
    Get Familiar withOptions NOT to Take Pains about Technology!
  • 25.