This document provides an overview of Treasure Data's big data analytics platform. It discusses how Treasure Data ingests and processes large amounts of schema-less data from various sources in real-time and at scale. It also describes how Treasure Data stores and indexes the data for fast querying using SQL interfaces while maintaining schema flexibility.
From zero to hero - Easy log centralization with Logstash and ElasticsearchRafał Kuć
Presentation I gave during DevOps Days Warsaw 2014 about combining Elasticsearch, Logstash and Kibana together or use our Logsene solution instead of Elasticsearch.
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
With growing trend of Big data, companies are tend to rely on high cost SIEM solutions. However, with introduction of open source and lightweight cluster management solution like ElasticSearch this has been the highlight of the year. Similarly, the log aggregation has been simplified by logstash and kibana providing a visual look to the complex data structure. This presentation will exactly cater to this need of having a appropriate log analysis+Detecting Intrusion+Visualizing data in a powerful interface.
An updated talk about how to use Solr for logs and other time-series data, like metrics and social media. In 2016, Solr, its ecosystem, and the operating systems it runs on have evolved quite a lot, so we can now show new techniques to scale and new knobs to tune.
We'll start by looking at how to scale SolrCloud through a hybrid approach using a combination of time- and size-based indices, and also how to divide the cluster in tiers in order to handle the potentially spiky load in real-time. Then, we'll look at tuning individual nodes. We'll cover everything from commits, buffers, merge policies and doc values to OS settings like disk scheduler, SSD caching, and huge pages.
Finally, we'll take a look at the pipeline of getting the logs to Solr and how to make it fast and reliable: where should buffers live, which protocols to use, where should the heavy processing be done (like parsing unstructured data), and which tools from the ecosystem can help.
For the Docker users out there, Sematext's DevOps Evangelist, Stefan Thies, goes through a number of different Docker monitoring options, points out their pros and cons, and offers solutions for Docker monitoring. Webinar contains actionable content, diagrams and how-to steps.
'Scalable Logging and Analytics with LogStash'Cloud Elements
Rich Viet, Principal Engineer at Cloud Elements presents 'Scalable Logging and Analytics with LogStash' at All Things API meetup in Denver, CO.
Learn more about scalable logging and analytics using LogStash. This will be an overview of logstash components, including getting started, indexing, storing and getting information from logs.
Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use (like, for searching).
Sematext's DevOps Evangelist, Stefan Thies (@seti321), takes a Docker Logging tour through the different log collection options Docker users have, the pros and cons of each, specific and existing Docker logging solutions, tooling, the role of syslog, log shipping to ELK Stack, and more. Q&A session at end.
From zero to hero - Easy log centralization with Logstash and ElasticsearchRafał Kuć
Presentation I gave during DevOps Days Warsaw 2014 about combining Elasticsearch, Logstash and Kibana together or use our Logsene solution instead of Elasticsearch.
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
With growing trend of Big data, companies are tend to rely on high cost SIEM solutions. However, with introduction of open source and lightweight cluster management solution like ElasticSearch this has been the highlight of the year. Similarly, the log aggregation has been simplified by logstash and kibana providing a visual look to the complex data structure. This presentation will exactly cater to this need of having a appropriate log analysis+Detecting Intrusion+Visualizing data in a powerful interface.
An updated talk about how to use Solr for logs and other time-series data, like metrics and social media. In 2016, Solr, its ecosystem, and the operating systems it runs on have evolved quite a lot, so we can now show new techniques to scale and new knobs to tune.
We'll start by looking at how to scale SolrCloud through a hybrid approach using a combination of time- and size-based indices, and also how to divide the cluster in tiers in order to handle the potentially spiky load in real-time. Then, we'll look at tuning individual nodes. We'll cover everything from commits, buffers, merge policies and doc values to OS settings like disk scheduler, SSD caching, and huge pages.
Finally, we'll take a look at the pipeline of getting the logs to Solr and how to make it fast and reliable: where should buffers live, which protocols to use, where should the heavy processing be done (like parsing unstructured data), and which tools from the ecosystem can help.
For the Docker users out there, Sematext's DevOps Evangelist, Stefan Thies, goes through a number of different Docker monitoring options, points out their pros and cons, and offers solutions for Docker monitoring. Webinar contains actionable content, diagrams and how-to steps.
'Scalable Logging and Analytics with LogStash'Cloud Elements
Rich Viet, Principal Engineer at Cloud Elements presents 'Scalable Logging and Analytics with LogStash' at All Things API meetup in Denver, CO.
Learn more about scalable logging and analytics using LogStash. This will be an overview of logstash components, including getting started, indexing, storing and getting information from logs.
Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use (like, for searching).
Sematext's DevOps Evangelist, Stefan Thies (@seti321), takes a Docker Logging tour through the different log collection options Docker users have, the pros and cons of each, specific and existing Docker logging solutions, tooling, the role of syslog, log shipping to ELK Stack, and more. Q&A session at end.
pandas.(to/from)_sql is simple but not fastUwe Korn
Pandas provides convenience methods to read and write to databases using to_sql and read_sql. They provide great usability and a uniform interface for all databases that support an SQL Alchemy connection. Sadly, the layer of convenience also introduces a performance loss. Luckily, for a lot of databases, a performant access layer is available.
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...StreamNative
Despite what the Ghostbusters said, we’re going to go ahead and cross (or, join) the streams. This session covers getting started with streaming data pipelines, maximizing Pulsar’s messaging system alongside one of the most flexible streaming frameworks available, Apache Flink. Specifically, we’ll demonstrate the use of Flink SQL, which provides various abstractions and allows your pipeline to be language-agnostic. So, if you want to leverage the power of a high-speed, highly customizable stream processing engine without the usual overhead and learning curves of the technologies involved (and their interconnected relationships), then this talk is for you. Watch the step-by-step demo to build a unified batch and streaming pipeline from scratch with Pulsar, via the Flink SQL client. This means you don’t need to be familiar with Flink, (or even a specific programming language). The examples provided are built for highly complex systems, but the talk itself will be accessible to any experience level.
Top 5 things to know about sql azure for developersIke Ellis
Databases in the cloud are a brave new world. This presentation will show up the issues with migrating your application to SQL Azure and how to address them.
Fluentd meetup dive into fluent plugin (outdated)N Masahiro
Fluentd meetup in Japan. I talked about "Dive into Fluent plugin".
Some contents are outdated. See this slide: http://www.slideshare.net/repeatedly/dive-into-fluentd-plugin-v012
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
5. Treasure Data Service
> A simplified cloud analytics infrastructure
> Customers focus on their business
> SQL interfaces for Schema-less data sources
> Fit for Data Hub / Lake
> Batch / Low latency / Machine Learning
> Lots of ingestion and integrated solutions
> Fluentd / Embulk / Data Connector / SDKs
> Result Output / Prestogres Gateway / BI tools
> Awesome support for time to value
8. Plazma by the numbers
> Streaming import
> 45 billion records / day
> Bulk Import
> 10 billion records / day
> Hive Query
> 3+ trillion records / day
> Machine Learning queries, Hivemall, increased
> Presto Query
> 3+ trillion records / day
9. TD’s resource management
> Guarantee and boost compute resources
> Guarantee for stabilizing query performance
> Boost for sharing free resources
> Get multi-tenant merit
> Global resource schedular
> manage job, resource and priority across users
> Separate storage from compute resource
> Easy to scale workers
> We can use S3 / GCS / Azure Storage for reliable backend
11. Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for
5 minute
✓ Retrying
(at-least once)
✓ On-disk buffering
on failure
✓ Unique ID for
each chunk
API
Server
It’s like JSON.
but fast and small.
unique_id=375828ce5510cadb
{“time”:1426047906,”uid”:1,…}
{“time”:1426047912,”uid”:9,…}
{“time”:1426047939,”uid”:3,…}
{“time”:1426047951,”uid”:2,…}
…
MySQL
(PerfectQueue)
12. Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for
1 minute
✓ Retrying
(at-least once)
✓ On-disk buffering
on failure
✓ Unique ID for
each chunk
API
Server
It’s like JSON.
but fast and small.
MySQL
(PerfectQueue)
unique_id time
375828ce5510cadb 2015-12-01 10:47
2024cffb9510cadc 2015-12-01 11:09
1b8d6a600510cadd 2015-12-01 11:21
1f06c0aa510caddb 2015-12-01 11:38
13. Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for
5 minute
✓ Retrying
(at-least once)
✓ On-disk buffering
on failure
✓ Unique ID for
each chunk
API
Server
It’s like JSON.
but fast and small.
MySQL
(PerfectQueue)
unique_id time
375828ce5510cadb 2015-12-01 10:47
2024cffb9510cadc 2015-12-01 11:09
1b8d6a600510cadd 2015-12-01 11:21
1f06c0aa510caddb 2015-12-01 11:38UNIQUE
(at-most once)
16. Realtime
Storage
PostgreSQL
Amazon S3 /
Basho Riak CS
Metadata
Import
Queue
Import
Worker
Import
Worker
Import
Worker
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,
2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32,
2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43,
2015-12-01 11:40:49]
14
… … … …
Archive
Storage
Metadata of the
records in a file
(stored on
PostgreSQL)
17. Amazon S3 /
Basho Riak CS
Metadata
Merge Worker
(MapReduce)
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,
2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32,
2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43,
2015-12-01 11:40:49]
14
… … … …
file index range records
[2015-12-01 10:00:00,
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,
2015-12-01 12:00:00]
2,143
… … …
Realtime
Storage
Archive
Storage
PostgreSQL
Merge every 1 hourRetrying + Unique
(at-least-once + at-most-once)
18. Amazon S3 /
Basho Riak CS
Metadata
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,
2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32,
2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43,
2015-12-01 11:40:49]
14
… … … …
file index range records
[2015-12-01 10:00:00,
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,
2015-12-01 12:00:00]
2,143
… … …
Realtime
Storage
Archive
Storage
PostgreSQL
GiST (R-tree) Index
on“time” column on the files
Read from Archive Storage if merged.
Otherwise, from Realtime Storage
19. Data Importing
> Scalable & Reliable importing
> Fluentd buffers data on a disk
> Import queue deduplicates uploaded chunks
> Workers take the chunks and put to Realtime Storage
> Instant visibility
> Imported data is immediately visible by query engines.
> Background workers merges the files every 1 hour.
> Metadata
> Index is built on PostgreSQL using RANGE type and
GiST index
21. time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive
Storage
Files on Amazon S3 / Basho Riak CS
Metadata on PostgreSQL
path index range records
[2015-12-01 10:00:00,
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,
2015-12-01 12:00:00]
2,143
… … …
MessagePack Columnar
File Format
22. time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive
Storage
path index range records
[2015-12-01 10:00:00,
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,
2015-12-01 12:00:00]
2,143
… … …
column-based partitioning
time-based partitioning
Files on Amazon S3 / Basho Riak CS
Metadata on PostgreSQL
23. time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive
Storage
path index range records
[2015-12-01 10:00:00,
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,
2015-12-01 12:00:00]
2,143
… … …
column-based partitioning
time-based partitioning
Files on Amazon S3 / Basho Riak CS
Metadata on PostgreSQL
SELECT code, COUNT(1) FROM logs
WHERE time >= 2015-12-01 11:00:00
GROUP BY code
24. Handling Eventual Consistency
1. Writing data / metadata first
> At this time, data is not visible
2. Check data is available or not
> GET, GET, GET…
3. Data become visible
> Query includes imported data!
Ex. Netflix case
> https://github.com/Netflix/s3mper
25. Hide network cost
> Open a lot of connections to Object Storage
> Using range feature with columnar offset
> Improve scan performance for partitioned data
> Detect recoverable error
> We have error lists for fault tolerance
> Stall checker
> Watch the progress of reading data
> If processing time reached threshold, re-connect to OS
and re-read data
27. Recoverable errors
> Error types
> User error
> Syntax error, Semantic error
> Insufficient resource
> Exceeded task memory size
> Internal failure
> I/O error of S3 / Riak CS
> worker failure
> etc
We can retry these patterns
28. Recoverable errors
> Error types
> User error
> Syntax error, Semantic error
> Insufficient resource
> Exceeded task memory size
> Internal failure
> I/O error of S3 / Riak CS
> worker failure
> etc
We can retry these patterns
29. Presto retry on Internal Errors
> Query succeed eventually
log scale
30. time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
user time code method
391 2015-12-01 11:10:09 200 GET
482 2015-12-01 11:21:45 200 GET
573 2015-12-01 11:38:59 200 GET
664 2015-12-01 11:43:37 200 GET
755 2015-12-01 11:54:52 “200” GET
… … …
31. time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
user time code method
391 2015-12-01 11:10:09 200 GET
482 2015-12-01 11:21:45 200 GET
573 2015-12-01 11:38:59 200 GET
664 2015-12-01 11:43:37 200 GET
755 2015-12-01 11:54:52 “200” GET
… … …
MessagePack Columnar
File Format is schema-less
✓ Instant schema change
SQL is schema-full
✓ SQL doesn’t work
without schema
Schema-on-Read
37. Hadoop
> Distributed computing framework
> Consist of many components…
http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
38. Presto
>
> Open sourced by Facebook
> https://github.com/facebook/presto
A distributed SQL query engine
for interactive data analisys
against GBs to PBs of data.
39. Conclusion
> Build scalable data analytics platform on Cloud
> Separate resource and storage
> loosely-coupled components
> We have lots of useful OSS and services :)
> There are many trade-off
> Use existing component or create new component?
> Stick to the basics!
> If you tired, please use Treasure Data ;)