Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
This document discusses Fluentd and its webhdfs output plugin. It explains how the webhdfs plugin was created in 30 minutes by leveraging existing Ruby gems for WebHDFS operations and output formatting. The document concludes that output plugins can reuse code from mixins and that developing shared mixins allows plugins to incorporate common features more easily.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
DB2 is a database manager that runs on Linux, Unix, and Windows operating systems. It allows users to catalog databases, start and stop instances, and configure parameters. Key commands for managing DB2 include db2icrt for creating instances, db2idrop for dropping instances, db2ilist for listing instances, and db2set for setting configuration parameters at the global, instance, and node level. The db2set command provides centralized control over environmental variables.
Zigbee is a wireless technology standard used for low-power wireless networks that are commonly used in home automation and industrial control applications. It operates on open global standards that define reliable, cost-effective networks. Zigbee networks employ a mesh topology that allows for robust coverage and self-healing capabilities. The technology supports low data rates, short range, and long battery life, making it suitable for applications like wireless light switches, electrical meters, and other IoT devices.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.
When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.
Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas
This document discusses Fluentd and its webhdfs output plugin. It explains how the webhdfs plugin was created in 30 minutes by leveraging existing Ruby gems for WebHDFS operations and output formatting. The document concludes that output plugins can reuse code from mixins and that developing shared mixins allows plugins to incorporate common features more easily.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
DB2 is a database manager that runs on Linux, Unix, and Windows operating systems. It allows users to catalog databases, start and stop instances, and configure parameters. Key commands for managing DB2 include db2icrt for creating instances, db2idrop for dropping instances, db2ilist for listing instances, and db2set for setting configuration parameters at the global, instance, and node level. The db2set command provides centralized control over environmental variables.
Zigbee is a wireless technology standard used for low-power wireless networks that are commonly used in home automation and industrial control applications. It operates on open global standards that define reliable, cost-effective networks. Zigbee networks employ a mesh topology that allows for robust coverage and self-healing capabilities. The technology supports low data rates, short range, and long battery life, making it suitable for applications like wireless light switches, electrical meters, and other IoT devices.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
This document summarizes a presentation about log forwarding at scale. It discusses how logging works internally and requires understanding the logging pipeline of parsing, filtering, buffering and routing logs. It then introduces Fluent Bit as a lightweight log forwarder that can be used to cheaply forward logs from edge nodes to log aggregators in a scalable way, especially in cloud native environments like Kubernetes. Hands-on demos show how Fluent Bit can parse and add metadata to Kubernetes logs.
A Kafka-based platform to process medical prescriptions of Germany’s health i...HostedbyConfluent
With the beginning of 2022, the German government introduced an electronic prescription format for medications, comprising everything from the prescription through the pharmacy dispense to the invoice. It is described by FHIR – the global standard for exchanging electronic health data. Together with spectrumK, a service company for Germany's health insurers, we have built a platform on top of Apache Kafka and Kafka Streams that can process and approve prescriptions at large scale.
In this session, we present different aspects of the platform. We highlight the benefits of our approach - converting the complex FHIR schemas to Protobuf - compared to working directly with data in the FHIR format. We further showcase how we use Kafka Streams to integrate a multitude of sources and build complex profiles of master data. These profiles are then exposed through an interplay of Kafka, Protobuf and GraphQL and among others, requested during the approval process. This complex process includes a variety of microservices. We explain how we have developed an asynchronous and synchronous mode for the process, so that the platform can support orthogonal requirements. Finally, we share our learnings on how to auto-scale such a platform.
The pluggable database was created from an XML file that had Oracle Spatial enabled, but the container database no longer had Spatial enabled. This caused the pluggable database to open in restricted session mode. The mismatch between the pluggable database and container database configurations for the Oracle Spatial feature can be resolved by either recreating the XML file without Spatial or by connecting to the pluggable database in restricted mode and disabling Spatial within the pluggable database.
Integrating Apache Kafka and Elastic Using the Connect Frameworkconfluent
As a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and excels at processing streams of real-time events. Kafka provides reliable, millisecond delivery for connecting downstream systems with real-time data.
In this talk, we will show how easy it is to leverage Kafka and the Elasticsearch connector to keep your indices populated with the latest data from the rest of your enterprise, as it changes.
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
Slides for Amey Banarse's, Principal Data Architect at Yugabyte, "Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB" webinar recorded on Oct 30, 2019 at 11 AM Pacific.
Playback here: https://vimeo.com/369929255
This document defines and describes different types of firewalls. It begins by defining a firewall as a hardware or software network security system that controls incoming and outgoing network traffic using rules. It notes that the first firewall was invented by William Cheswick, Steven Bellovin, and others. The document then describes three main types of firewalls - packet filtering, application proxy, and hybrid - and provides details on their workings, advantages, and disadvantages. It concludes by stating that while firewalls help protect systems, they are not fully secure on their own.
Slides of SNMP (Simple network management protocol)Shahrukh Ali Khan
SNMP is a protocol for remotely managing and monitoring devices on a network. It allows network administrators to manage nodes by retrieving and setting configuration parameters like IP addresses, counters, and software versions. SNMP uses agents running on devices which collect data and send it to managing nodes upon request or autonomously via traps. It operates over UDP on ports 161 and 162 using a simple request-response model.
This document provides an overview of Apache Kafka. It begins with defining Kafka as a distributed streaming platform and messaging system. It then lists the agenda which includes what Kafka is, why it is used, common use cases, major companies that use it, how it achieves high performance, and core concepts. Core concepts explained include topics, partitions, brokers, replication, leaders, and producers and consumers. The document also provides examples to illustrate these concepts.
This document provides tips and best practices for using Amazon DynamoDB. It discusses using indexes like local secondary indexes (LSI) and global secondary indexes (GSI) as well as scaling DynamoDB. It also covers data modeling patterns for different types of relationships and using DynamoDB for scenarios like storing time series data and building catalogs. The document provides two case studies for using DynamoDB with other AWS services like S3, Lambda, Elasticsearch and EMR/Hive to enable big data analytics.
Apache Kafka is a high-throughput distributed messaging system that allows for both streaming and offline log processing. It uses Apache Zookeeper for coordination and supports activity stream processing and real-time pub/sub messaging. Kafka bridges the gaps between pure offline log processing and traditional messaging systems by providing features like batching, transactions, persistence, and support for multiple consumers.
This document provides an overview of SQL and NoSQL databases. It defines SQL as a language used to communicate with relational databases, allowing users to query, manipulate, and retrieve data. NoSQL databases are defined as non-relational and allow for flexible schemas. The document compares key aspects of SQL and NoSQL such as data structure, querying, scalability and provides examples of popular SQL and NoSQL database systems. It concludes that both SQL and NoSQL databases will continue to be important with polyglot persistence, using the best database for each storage need.
RocksDB is an embedded key-value store written in C++ and optimized for fast storage environments like flash or RAM. It uses a log-structured merge tree to store data by writing new data sequentially to an in-memory log and memtable, periodically flushing the memtable to disk in sorted SSTables. It reads from the memtable and SSTables, and performs background compaction to merge SSTables and remove overwritten data. RocksDB supports two compaction styles - level style, which stores SSTables in multiple levels sorted by age, and universal style, which stores all SSTables in level 0 sorted by time.
The document discusses LSM-trees, which are data structures used in Cassandra and other databases. An LSM-tree improves write performance by storing data in an in-memory tree and writing large batches to disk. It supports fast reads by merging data from memory and disk. The document provides examples of how Cassandra uses LSM-trees to handle writes, reads, and compaction of SSTables from memory to disk.
An overview of the Amazon ElastiCache managed service, with examples of how it can be used to increase performance, lower costs and augment other database services and databases to make things faster, easier and less expensive.
(Jason Gustafson, Confluent) Kafka Summit SF 2018
Kafka has a well-designed replication protocol, but over the years, we have found some extremely subtle edge cases which can, in the worst case, lead to data loss. We fixed the cases we were aware of in version 0.11.0.0, but shortly after that, another edge case popped up and then another. Clearly we needed a better approach to verify the correctness of the protocol. What we found is Leslie Lamport’s specification language TLA+.
In this talk I will discuss how we have stepped up our testing methodology in Apache Kafka to include formal specification and model checking using TLA+. I will cover the following:
1. How Kafka replication works
2. What weaknesses we have found over the years
3. How these problems have been fixed
4. How we have used TLA+ to verify the fixed protocol.
This talk will give you a deeper understanding of Kafka replication internals and its semantics. The replication protocol is a great case study in the complex behavior of distributed systems. By studying the faults and how they were fixed, you will have more insight into the kinds of problems that may lurk in your own designs. You will also learn a little bit of TLA+ and how it can be used to verify distributed algorithms.
In this tutorial, we cover the different deployment possibilities of the MySQL architecture depending on the business requirements for the data. We also deploy some architecture and see how to evolve to the next one.
The tutorial covers the new MySQL Solutions like InnoDB ReplicaSet, InnoDB Cluster, and InnoDB ClusterSet.
This document summarizes a presentation about log forwarding at scale. It discusses how logging works internally and requires understanding the logging pipeline of parsing, filtering, buffering and routing logs. It then introduces Fluent Bit as a lightweight log forwarder that can be used to cheaply forward logs from edge nodes to log aggregators in a scalable way, especially in cloud native environments like Kubernetes. Hands-on demos show how Fluent Bit can parse and add metadata to Kubernetes logs.
A Kafka-based platform to process medical prescriptions of Germany’s health i...HostedbyConfluent
With the beginning of 2022, the German government introduced an electronic prescription format for medications, comprising everything from the prescription through the pharmacy dispense to the invoice. It is described by FHIR – the global standard for exchanging electronic health data. Together with spectrumK, a service company for Germany's health insurers, we have built a platform on top of Apache Kafka and Kafka Streams that can process and approve prescriptions at large scale.
In this session, we present different aspects of the platform. We highlight the benefits of our approach - converting the complex FHIR schemas to Protobuf - compared to working directly with data in the FHIR format. We further showcase how we use Kafka Streams to integrate a multitude of sources and build complex profiles of master data. These profiles are then exposed through an interplay of Kafka, Protobuf and GraphQL and among others, requested during the approval process. This complex process includes a variety of microservices. We explain how we have developed an asynchronous and synchronous mode for the process, so that the platform can support orthogonal requirements. Finally, we share our learnings on how to auto-scale such a platform.
The pluggable database was created from an XML file that had Oracle Spatial enabled, but the container database no longer had Spatial enabled. This caused the pluggable database to open in restricted session mode. The mismatch between the pluggable database and container database configurations for the Oracle Spatial feature can be resolved by either recreating the XML file without Spatial or by connecting to the pluggable database in restricted mode and disabling Spatial within the pluggable database.
Integrating Apache Kafka and Elastic Using the Connect Frameworkconfluent
As a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and excels at processing streams of real-time events. Kafka provides reliable, millisecond delivery for connecting downstream systems with real-time data.
In this talk, we will show how easy it is to leverage Kafka and the Elasticsearch connector to keep your indices populated with the latest data from the rest of your enterprise, as it changes.
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
Slides for Amey Banarse's, Principal Data Architect at Yugabyte, "Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB" webinar recorded on Oct 30, 2019 at 11 AM Pacific.
Playback here: https://vimeo.com/369929255
This document defines and describes different types of firewalls. It begins by defining a firewall as a hardware or software network security system that controls incoming and outgoing network traffic using rules. It notes that the first firewall was invented by William Cheswick, Steven Bellovin, and others. The document then describes three main types of firewalls - packet filtering, application proxy, and hybrid - and provides details on their workings, advantages, and disadvantages. It concludes by stating that while firewalls help protect systems, they are not fully secure on their own.
Slides of SNMP (Simple network management protocol)Shahrukh Ali Khan
SNMP is a protocol for remotely managing and monitoring devices on a network. It allows network administrators to manage nodes by retrieving and setting configuration parameters like IP addresses, counters, and software versions. SNMP uses agents running on devices which collect data and send it to managing nodes upon request or autonomously via traps. It operates over UDP on ports 161 and 162 using a simple request-response model.
This document provides an overview of Apache Kafka. It begins with defining Kafka as a distributed streaming platform and messaging system. It then lists the agenda which includes what Kafka is, why it is used, common use cases, major companies that use it, how it achieves high performance, and core concepts. Core concepts explained include topics, partitions, brokers, replication, leaders, and producers and consumers. The document also provides examples to illustrate these concepts.
This document provides tips and best practices for using Amazon DynamoDB. It discusses using indexes like local secondary indexes (LSI) and global secondary indexes (GSI) as well as scaling DynamoDB. It also covers data modeling patterns for different types of relationships and using DynamoDB for scenarios like storing time series data and building catalogs. The document provides two case studies for using DynamoDB with other AWS services like S3, Lambda, Elasticsearch and EMR/Hive to enable big data analytics.
Apache Kafka is a high-throughput distributed messaging system that allows for both streaming and offline log processing. It uses Apache Zookeeper for coordination and supports activity stream processing and real-time pub/sub messaging. Kafka bridges the gaps between pure offline log processing and traditional messaging systems by providing features like batching, transactions, persistence, and support for multiple consumers.
This document provides an overview of SQL and NoSQL databases. It defines SQL as a language used to communicate with relational databases, allowing users to query, manipulate, and retrieve data. NoSQL databases are defined as non-relational and allow for flexible schemas. The document compares key aspects of SQL and NoSQL such as data structure, querying, scalability and provides examples of popular SQL and NoSQL database systems. It concludes that both SQL and NoSQL databases will continue to be important with polyglot persistence, using the best database for each storage need.
RocksDB is an embedded key-value store written in C++ and optimized for fast storage environments like flash or RAM. It uses a log-structured merge tree to store data by writing new data sequentially to an in-memory log and memtable, periodically flushing the memtable to disk in sorted SSTables. It reads from the memtable and SSTables, and performs background compaction to merge SSTables and remove overwritten data. RocksDB supports two compaction styles - level style, which stores SSTables in multiple levels sorted by age, and universal style, which stores all SSTables in level 0 sorted by time.
The document discusses LSM-trees, which are data structures used in Cassandra and other databases. An LSM-tree improves write performance by storing data in an in-memory tree and writing large batches to disk. It supports fast reads by merging data from memory and disk. The document provides examples of how Cassandra uses LSM-trees to handle writes, reads, and compaction of SSTables from memory to disk.
An overview of the Amazon ElastiCache managed service, with examples of how it can be used to increase performance, lower costs and augment other database services and databases to make things faster, easier and less expensive.
(Jason Gustafson, Confluent) Kafka Summit SF 2018
Kafka has a well-designed replication protocol, but over the years, we have found some extremely subtle edge cases which can, in the worst case, lead to data loss. We fixed the cases we were aware of in version 0.11.0.0, but shortly after that, another edge case popped up and then another. Clearly we needed a better approach to verify the correctness of the protocol. What we found is Leslie Lamport’s specification language TLA+.
In this talk I will discuss how we have stepped up our testing methodology in Apache Kafka to include formal specification and model checking using TLA+. I will cover the following:
1. How Kafka replication works
2. What weaknesses we have found over the years
3. How these problems have been fixed
4. How we have used TLA+ to verify the fixed protocol.
This talk will give you a deeper understanding of Kafka replication internals and its semantics. The replication protocol is a great case study in the complex behavior of distributed systems. By studying the faults and how they were fixed, you will have more insight into the kinds of problems that may lurk in your own designs. You will also learn a little bit of TLA+ and how it can be used to verify distributed algorithms.
In this tutorial, we cover the different deployment possibilities of the MySQL architecture depending on the business requirements for the data. We also deploy some architecture and see how to evolve to the next one.
The tutorial covers the new MySQL Solutions like InnoDB ReplicaSet, InnoDB Cluster, and InnoDB ClusterSet.
Custom DevOps Monitoring System in MelOn (with InfluxDB + Telegraf + Grafana)Seungmin Yu
2016년도 데이터야놀자에서 발표한 자료입니다.
멜론에서 InfluxDB + Telegraf + Grafana 조합으로 모니터링 시스템을 구축하고 활용한 사례를 발표한 내용입니다. 다양한 메트릭데이터와 DevOps 측면의 활용 가치에 대해서도 생각해 볼 수 있을 것 같습니다.
1. InfiniFlux DBMS VS Influx DBMS
저장 및 검색 성능 테스트 결과
기술 문서(Technical Note)
2016-12-05
INFINIFLUX
www.infiniflux.com
2. 목 차
1. 개요 .....................................................................................1
1.1 H/W 환경.......................................................................................1
1.2 S/W 환경.......................................................................................1
1.3 테스트 대상 데이터.....................................................................1
1.3.1 InfiniFlux 테이블.........................................................................................2
1.3.2 Influxdb 테이블...........................................................................................2
1.4 대상 데이터 파일.........................................................................3
2. 성능 테스트.........................................................................4
2.1 데이터 입력..................................................................................4
2.1.1 InfiniFlux 데이터 입력...............................................................................4
2.1.2 Influxdb 데이터 입력.................................................................................4
2.1.3 InfinFlux VS Influxdb..................................................................................4
2.2 DISK 저장 용량............................................................................5
2.3 시스템 자원 사용.........................................................................5
2.3.1 CPU 사용률 ................................................................................................5
2.3.2 Memory 사용량 ..........................................................................................6
2.4 쿼리 테스트..................................................................................6
4. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
1
1. 개요
1.1 H/W 환경
CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz * 8 CORE
MEM : 32GB
DISK: HDD 3.6 TB * 5
1.2 S/W 환경
OS : CentOS 6.7
DB : InfiniFlux 3.1.1 VS Influxdb 1.0.2
1.3 테스트 대상 데이터
InfiniFlux 에서 타사 비교 테스트 용 데이터와 테이블을 사용
일반적인 weblog에 대하여 추가 칼럼을 생성하여 데이터 생성
시간, IP, port, 이벤트 내용으로 구성
InfiniFlux 샘플 데이터
2015-05-20 06:00:00,219.229.142.218,2762,7.234.88.67,593,62,GET
/twiki/bin/view/Main/MikeMannix HTTP/1.1,200,3686
2015-05-20 06:00:11,100.46.183.122,11989,227.106.13.91,4709,50,GET
/mailman/listinfo/administration HTTP/1.1,200,6459
2015-05-20 06:00:11,214.153.107.182,7586,5.114.66.53,5213,6,GET
/twiki/bin/view/Main/SpamAssassin HTTP/1.1,200,4081
Influxdb 샘플 데이터
샘플 데이터의 값이나 구성은 동일 하지만 입력 형식이 달라서 포맷을 변경
제일 처음 나오는 test_influxdb는 table명이고 이후에 스페이스바 공백으로 tag key와
field key로 구분되게 된다.
마지막은 date 값이 유닉스 타임으로 입력된다. – precision rfc3339/n/ns/m 명령어로
입력 할 때 시간 포맷을 설정 가능하다.
각 data 값 앞에는 column명이 입력 되어야 한다.
test_influxdb,srcip=123.198.82.192
dstip="50.230.44.173",srcport=9978,dstport=782,protocol=16,eventlog="GET
/twiki/bin/view/TWiki/KlausWriessnegger HTTP/1.1",evencod=200,eventsize=3848
1479265350401897972
test_influxdb,srcip=58.208.78.121
5. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
2
dstip="231.146.69.51",srcport=2330,dstport=3082,protocol=46,eventlog="GET
/twiki/bin/view/TWiki/ManagingWebs?rev=1.22 HTTP/1.1",evencod=200,eventsize=9310
1479265197208287210
1.3.1 InfiniFlux 테이블
테이블 생성
create table sampletable
(
AT1 datetime property (MINMAX_CACHE_SIZE=1048576) not null,
SRCIP ipv4,
SCRPORT integer,
DSTIP ipv4,
DSTPORT integer,
PROTOCOL short,
eventlog varchar(1024),
eventcode short,
eventsize long
);
Index 생성
create index idx2 on sampletable(srcip) index_type lsm max_level=3;
create index idx3 on sampletable(dstip) index_type lsm max_level=3;
create keyword index idx4 on sampletable(eventlog) index_type lsm max_level=3;
- 테이블의 컬럼은 각 데이터 타입에 맞게 생성
- datetime 컬럼의 경우 minmax_cache 설정
- 조건 검색을 위한 index 생성
1.3.2 InfluxDB 테이블
테이블
Influxdb는 tag key와 field key가 있는데 tag key에 관해서만 indexing을 한다.
아래는 데이터 입력 후 influxdb 내에 있는 table field 명 출력 한 것임
name: test_influxdb
-------------------
6. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
3
tagKey
srcip
evencod
name: test_influxdb
fieldKey fieldType
dstip string
dstport float
eventlog string
eventsize float
protocol float
srcport float
- 테이블의 컬럼은 각 입력 데이터에 맞게 타입 자동 설정
- tag key는 srcip 의 컬럼으로 입력 (group by 하기 위해서)
- 순차의 날짜 데이터 입력 시 table 상에 column으로는 출력 되지 않지만 임의의 데이터
입력 시간으로 입력 되어있음
1.4 대상 데이터 파일
InfiniFlux 샘플 데이터 (1억건 13GB)
-rw-rw-r--. 1 demo demo 13045863917 May 20 11:15 sampletable100m.csv
Influxdb 샘플 데이터 (1억건 20GB)
포맷이 달라서 새로운 파일 생성
포맷 형식 때문에 data 용량 차이가 발생
-rw-rw-r-- 1 perf perf 21253641580 Nov 16 13:37 sampletable100m.csv
7. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
4
2. 성능 테스트
2.1 데이터 입력
2.1.1 InfiniFlux 데이터 입력
InfiniFlux 에 데이터를 입력 하는 방법은 InfiniFlux 에서 제공하는 ifluxloader tool을 사용하여
CSV 파일을 테이블에 입력 하는 방법을 사용
파일명 대상 테이블 건수 용량 시간 EPS
sampletable100m.csv sampletable 1억건 13GB 4.96 min 336,022
- 전체 1억건에 대하여 초당 33만개의 데이터가 입력.
2.1.2 Influxdb 데이터 입력
Influxdb에 데이터 입력하는 방식은 influx -import -path=sampletable100m.csv -precision=ns
명령어를 사용 하여서 txt 데이터 파일을 입력 하는 방법을 사용
파일명 대상 테이블 건수 용량 시간 EPS
sampletable100m.csv test_influxdb 1억건 20GB 48.18 min 39,117
- 전체 1억건에 대하여 초당 약3.9만개의 데이터 입력
2.1.3 InfiniFlux vs Influxdb
- 테스트 결과 InfiniFlux 의 입력 속도가 Influxdb대비 10배 정도 빠름
- Influxdb는 입력을 연속으로 지속적으로 하면 입력 속도가 낮아지는 것을 확인
39,117
336022
0
100,000
200,000
300,000
400,000
초당 입력 속도
Influxdb
InfiniFlux
8. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
5
2.2 DISK 저장 용량
InfiniFlux는 데이터를 저장 할 때 압축하여 저장 하기 때문에 기존의 데이터 보다는 압축된
것을 알 수 있다.
- 13GB 의 데이터에 대하여 InfiniFlux 는 5GB로 , 20GB의 데이터에 Influxdb는 20GB 의
저장 공간을 사용한다.
- InfiniFlux는 원본대비 62%압축 Influxdb는 원본대비 0%압축 하였다.
2.3 시스템 자원 사용
시스템 리소스 측정은 dstat 로 5초 간격으로 측정 후 저장
시스템 리소스 측정은 개별 프로세스가 아닌 전체 시스템 사용량으로 측정
2.3.1 CPU 사용률
Influxdb 의 경우 데이터 입력 시에 평균 35% 의 CPU를 사용하는 것으로 측정
InfiniFlux 는 평균 24% 의 CPU 를 사용
0
5
10
15
20
25
저장 용량
Influxdb
InfiniFlux
0
50
100
100
235
370
505
640
775
910
1045
1180
1315
1450
1585
1720
1855
1990
2125
2260
2395
2530
2665
2800
CPU 사용률
influxdb
InfinFlux
9. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
6
2.3.2 Memory 사용량
- Influxdb 의 메모리 사용량은 데이터 입력 시 평균 1.4GB를 사용하며, Influxdb 를 종료
하기 전까지 사용 메모리 감소가 없음.
- InfiniFlux 는 평균 1.5GB를 사용하였으며, 상황에 따라 메모리 사용이 유동적으로 변함.
2.4 쿼리 테스트
Influxdb 쿼리의 경우 GROUP BY가 tag key에 만 적용되어 사용 되어서 입력할 때 적용
해야 된다.
Test는 srcip컬럼에만 tag key 설정하여 해당 컬럼에 만 group by가 된다.
group by를 하면 select에 적지 않아도 자동으로 분류 하여 출력한다.
Influxdb쿼리는에 삽입한 날짜 데이터 값은 임의의 생성한 유닉스 ns 값의 시간
전체 데이터 건수 카운트
select count(*) from test_cli..test_influxdb
단순 조건 검색
select count(*) from test_cli..test_influxdb
where (srcip='31.224.72.52' and dstip='86.45.186.17');
0
1000000
2000000
65
165
265
365
465
565
665
765
865
965
1065
1165
1265
1365
1465
1565
1665
1765
1865
1965
2065
2165
2265
2365
2465
2565
2665
2765
2865
Memory 사용률
Influxdb
InfinFlux
10. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
7
복합 조건 쿼리
select count(dstip), min(dstport),max(dstport), mean("eventsize")
from test_cli..test_influxdb
where (dstport <= 1000 and dstport >= 100) and evencod=200
group by srcip
Group by average 쿼리
select mean(eventsize) from test_cli..test_influxdb group by evencod
Group by count 쿼리
select count(*) from test_cli..test_influxdb group by evencod
Group by 연산 쿼리
select sum(eventsize) from test_cli..test_influxdb group by evencod
Group by STDDEV 연산 쿼리
Select stddev(eventsize) from test_cli..test_influxdb group by evencod
수행 결과
InfiniFlux Influxdb 결과값(건수)
전체 카운트 0.000 초 34.957 초 1억건
단순 조건 검색 0.214 초 4.245 초 17,640
Group by AVERAGE
for full-scan
11.076 초
8.36 초
(미리 계산,
Full scan의 경우에만
빠름)
401
12848
200
9599.9949
Group by count 7.738 초 40.673 초
401
19998781
200
80001219
Group by SUM 11.838 초 6.49 초 401
11. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
8
For full-scan (미리 계산,
Full scan의 경우에만
빠름)
256944328701
200
768010626424
Group by STDDEV 14.869 초 303.034 초
401
2.44939
200
12544.2
복합 조건 검색 30.852 초 52.126 초 75ROWS
12. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
9
3. 테스트 총평
InfiniFlux VS Influxdb테스트 결과 요약
InfiniFlux Influxdb 비고
원본 데이터 sampletable100m.csv (1억건, 13GB) 테스트용 샘플데이터
CPU 사용률(%) 24% 35%
100% = 8 core 전체
1.5배 적게 사용
메모리 사용량(GB) 1.5GB 1.4GB 1.1배 많이 사용
DISK 사용량(GB) 5GB 20GB
원본대비 Infiniflux
65%압축 Influxdb
0%압축
초당 입력 속도(EPS) 336,022 EPS 39,117 EPS 9배 빠름
쿼리
전체 카운트 0.000 초 34.957 초 결과건수 1억건
시간 검색 0.214 초 4.245 초 결과건수 1.7만건
복합 검색 30.852 초 52.126 초 75ROWS
GROUP BY
AVERAGE
11.076 초 8.36 초
401 12848
200 9599.99
GROUP BY
COUNT
7.738 초 40.673 초
401 19998781
200 80001219
GROUP BY
SUM
11.83 초 6.49 초
401 256944328701
200 768010626424
GROUP BY
STDDEV
14.869 초 303.034 초
401 2.44939
200 12544.2
- InfiniFlux 와 Influxdb는 기본적으로 SQL 문을 사용하는 것이 유사하지만, 기본적인
내부 설계에서 많은 차이가 발생
- db내부에 database라는 공간을 만든 후 안에 table 형식으로 저장
- database 안의 table들은 한 번에 통합 검색이 가능
- InfiniFlux 는 Column 형 DB를 기본 아키텍처로 사용하지만, Influxdb는 key-values 저장소
방식을 사용하고 있어, case에 따라 두 DB의 성능 및 사용방법에 많은 차이가 있음
- 데이터를 저장하는 방식의 차이로 InfiniFlux 는 기존 DB와 같이 Table schema 를 만들거
구조에 맞게 데이터를 입력 해야 하지만 Influxdb는 일부 칼럼만 데이터 저장이 가능
- 이런 저장 방식의 차이로 인하여 현재 Influxdb는 GROUP BY 쿼리는 tag key미적용
컬럼에는 사용이 불가능 하며, ORDER BY 와 LIMIT 사용에 제약이 있음.
- 또한 LIKE 구문과 같이 특정 키워드 검색이 불가능(문자열 ‘=’ 검색만 지원 ).
- Influxdb의 쿼리 속도 의 경우 결과 건수가 증가 할수록 속도가 느려지는 특성이 있어
쿼리를 사용 할 때 Equal 조건의 경우는 빠른 반응을 보이지만 tag key적용을 하지 않은
컬럼에 대한 검색을 할 때는 시간이 많이 걸리는 것을 확인
- 데이터를 입력 할 시 각 테이블 마다 tag key와 입력 시간이나 임의의 시간이 같을 때
중복 처리를 하여서 데이터를 입력하지 않는 것을 확인
13. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
10
- 기존 설정은 각 테이블 마다 1000,000 row 데이터를 저장 할 수 있도록 제약을
걸어두었음. Conf 파일을 수정하면 그 이상도 입력 가능
- Tag key를 많이 설정 할수록 데이터 입력 속도는 느려지고, 저장 용량은 크게 늘어난다.
14. Error! Use the Home tab to apply 제목 1 to the text that you want to appear here.
11
본 테스트를 통하여 InfiniFlux 와 Influxdb의 성능을 비교 하였다.
본 테스트 결과 Influxdb는 데이터를 분석하기에는 데이블 구성이나 설계를 InfiniFlux와는
많이 다르게 해야한다. Influxdb 같은 경우에는 많은 테이블을 가지고 쿼리 문으로
조건검색을 지향하고 있다. 각 테이블 마다 tag key를 설정하고 각 tag key를 묶어서
결과값을 통계 분석하는 방향으로 설계를 해야 하는 것 같다.
기본 테이블에 입력 제한 row가 있어서 분할 하고도 한 테이블에 많은 데이터가 몰리게
되면 문제가 발생되는 것 같다. 그리고 tag key설정을 한 컬럼이 중복도가 낮고, 데이터
량이 많게 되면 메모리 사용량은 물론이고 데이터 입력 속도가 많이 느려진다. 또한
인덱싱은 tag key를 중점으로 하기 때문에 다른 컬럼에 대한 인덱싱이 없어서 다른 컬럼에
만 검색을 하게 되면 시간이 좀 더 소요되는 것을 확인 하였다.
반면에 InfiniFlux 는 빠른 입력과 다양한 통계 분석 쿼리 사용이 가능하다, 손쉬운
인덱스 설정으로 데이터를 빠르게 분석하는 다양한 용도에 사용이 가능하다.