Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dealing with Unstructured Data: Scaling to Infinity


Published on

John Hammink
Developer Evangelist
Treasure Data Inc.

Great Wide Open 2016
Atlanta, GA
March 16th, 2016

Published in: Technology
  • Be the first to comment

Dealing with Unstructured Data: Scaling to Infinity

  1. 1. Dealing with Unstructured Data Scaling to Infinity Image: Boykung/Shutterstock
  2. 2. Image: John Hammink
  3. 3. There are many sources of information
  4. 4. Copyright ©2014 Treasure Data. All Rights Reserved. Results Push Results Push SQL Big Data Simplified: One ApproachAppServers Multi-structured Events • register • login • start_event • purchase • etc SQL-based Ad-hoc Queries SQL-based Dashboards DBs & Data Marts Other Apps Results Push Familiar & Table-oriented Infinite & Economical Cloud Data Store ✓App log data ✓Mobile event data ✓Sensor data ✓Telemetry Mobile SDKs Web SDK Multi-structured Events Multi-structured Events Multi-structured Events Multi-structured Events Agent Agent Agent Agent Agent Agent Agent Agent Embedded SDKs Server-side Agents
  5. 5. Copyright ©2014 Treasure Data. All Rights Reserved. What is the point of all this data? BI Business Intelligence Using Very Large Sets of Data
  6. 6. Copyright ©2015 Treasure Data. All Rights Reserved. Service Launched Series A Funding 100 Customers Selected by Gartner as Cool Vendor in Big Data 10 Trillion Records 5 Trillion Records Treasure Data By the Numbers (Jan-2015): 13T+ records of data imported since launch 500K+ records imported each second 1.5 Trillion+ records imported each month 12B records sent per day by one customer 13 Trillion Records Series B Funding Data Records Stored in the Treasure Data Cloud Service 0 3500000000000 7000000000000 10500000000000 14000000000000 Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13 Oct-13 Dec-13 Feb-14 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14 8 Last 2 years
  7. 7. Statistics Total Records Stored 25 Trillion Managed & Supported 24 * 7 * 365 Uptime 99.99% New Records / second 1 Million Daily Twitter volume 100x 1 0 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 24 / 7
  8. 8. A solution? • There are trade-offs to consider • Any trade off should make it easy to collect data • Easy does it! un- and semi-structured data (multi- structured data) • Open source means it’s free; also means that you need someone on hand to maintain and implement • Cloud storage means you don’t have to scale and/or shard; tradeoff means performance hit against bare metal Image: John Hammink
  9. 9. Image: Dreamstime
  10. 10. Images: Lightspring/Shutterstock, John Hammink, Treasure Data There are a few intro to Data Science blogs at!
  11. 11. What does a pipeline need?
  12. 12. Open vs. Closed source Image: Heather Craig/Shutterstock
  13. 13. Images: PC World, Data-Hive, Wallpapersmela or or ?
  15. 15. # logs from a file <source> type tail path /var/log/ httpd.log format apache2 tag web.access </source> # logs from client libraries <source> type forward port 24224 </source> # store logs to ES and HDFS <match *.*> type copy <store> type elasticsearch logstash_format
  17. 17. Before fluentd
  18. 18. Multi- structured data • un-structured data better for data for ultimate use in statistics
  19. 19. fluentd!
  20. 20.
  21. 21. an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services
  22. 22. Hivemall Hivemall is a scalable machine learning library that runs on Apache Hive. Hivemall is designed to be scalable to the number of training instances as well as the number of training features. • Classification • Regression • Recommendation • k-nearest neighbor • Anomaly Detection • Feature Engineering
  23. 23. The Hadoop Story on MongoDB Image courtesy of Steven Francia @ Docker
  24. 24. Questions?