Sometimes , some things work better than other things. MongoDB is great for quick access to low-latency data; Treasure Data is great for infinitely scalable historical data store. A lambda architecture is also explained.
How to get the best of both: MongoDB is great for low latency quick access of recent data; Treasure Data is great for infinitely growing store of historical data. In the latter case, one need not worry about scaling.
1. Eduardo Silva discussed unifying event and log data from multiple sources into the cloud using Fluentd and Fluent Bit.
2. Fluentd is an open source data collector that allows for parsing and storing data from multiple sources through its pluggable input and output plugins.
3. Fluent Bit is designed for collecting data from IoT and embedded devices to transport it to third party services, with a focus on performance and lightweight resource usage.
John Hammink's Talk at Great Wide Open 2016. We discuss: 1.) the need for data analytics infrastructure that can scale exponentially and 2.) what such an infrastructure must contain and finally 3.) the need for an infrastructure to be able to handle un - and semi-structured data.
Fluentd and Docker - running fluentd within a docker containerTreasure Data, Inc.
Fluentd is a data collection tool for unified logging that allows for extensible and reliable data collection. It uses a simple core with plugins to provide buffering, high availability, load balancing, and streaming data transfer based on JSON. Fluentd can collect log data from various sources and output to different destinations in a flexible way using its plugin architecture and configuration files. It is widely used in production for tasks like log aggregation, filtering, and forwarding.
This document discusses data collection and ingestion tools. It begins with an overview of data collection versus ingestion, with collection happening at the source and ingestion receiving the data. Examples of data collection tools include rsyslog, Scribe, Flume, Logstash, Heka, and Fluentd. Examples of ingestion tools include RabbitMQ, Kafka, and Fluentd. The document concludes with a case study of asynchronous application logging and challenges to consider.
Muga Nishizawa discusses Embulk, an open-source bulk data loader. Embulk loads records from various sources to various targets in parallel using plugins. Treasure Data customers use Embulk to upload different file formats and data sources to their TD database. While Embulk is focused on bulk loading, TD also develops additional tools to generate Embulk configurations, manage loads over time, and scale Embulk using a MapReduce executor on Hadoop clusters for very large data loads.
This document summarizes Johan Gustavsson's presentation on scaling Hadoop in the cloud. It discusses replacing an on-premise Hadoop cluster with Plazma storage on S3 and job execution in isolated pools. It also covers Treasure Data's Patchset project which aims to support multiple Hadoop versions and allow job-preserving restarts of the Elephant server.
This document discusses migrating data from MySQL to Amazon Redshift. It describes MySQL and Redshift, and some of the challenges of migrating between the two systems, such as incompatible schemas and manual processes. The proposed solution is to use a cloud data lake with schema-on-read to store JSON event data, which can then be loaded into Redshift, a cloud data warehouse with schema-on-write, providing an automated way to migrate data between different systems and schemas.
How to get the best of both: MongoDB is great for low latency quick access of recent data; Treasure Data is great for infinitely growing store of historical data. In the latter case, one need not worry about scaling.
1. Eduardo Silva discussed unifying event and log data from multiple sources into the cloud using Fluentd and Fluent Bit.
2. Fluentd is an open source data collector that allows for parsing and storing data from multiple sources through its pluggable input and output plugins.
3. Fluent Bit is designed for collecting data from IoT and embedded devices to transport it to third party services, with a focus on performance and lightweight resource usage.
John Hammink's Talk at Great Wide Open 2016. We discuss: 1.) the need for data analytics infrastructure that can scale exponentially and 2.) what such an infrastructure must contain and finally 3.) the need for an infrastructure to be able to handle un - and semi-structured data.
Fluentd and Docker - running fluentd within a docker containerTreasure Data, Inc.
Fluentd is a data collection tool for unified logging that allows for extensible and reliable data collection. It uses a simple core with plugins to provide buffering, high availability, load balancing, and streaming data transfer based on JSON. Fluentd can collect log data from various sources and output to different destinations in a flexible way using its plugin architecture and configuration files. It is widely used in production for tasks like log aggregation, filtering, and forwarding.
This document discusses data collection and ingestion tools. It begins with an overview of data collection versus ingestion, with collection happening at the source and ingestion receiving the data. Examples of data collection tools include rsyslog, Scribe, Flume, Logstash, Heka, and Fluentd. Examples of ingestion tools include RabbitMQ, Kafka, and Fluentd. The document concludes with a case study of asynchronous application logging and challenges to consider.
Muga Nishizawa discusses Embulk, an open-source bulk data loader. Embulk loads records from various sources to various targets in parallel using plugins. Treasure Data customers use Embulk to upload different file formats and data sources to their TD database. While Embulk is focused on bulk loading, TD also develops additional tools to generate Embulk configurations, manage loads over time, and scale Embulk using a MapReduce executor on Hadoop clusters for very large data loads.
This document summarizes Johan Gustavsson's presentation on scaling Hadoop in the cloud. It discusses replacing an on-premise Hadoop cluster with Plazma storage on S3 and job execution in isolated pools. It also covers Treasure Data's Patchset project which aims to support multiple Hadoop versions and allow job-preserving restarts of the Elephant server.
This document discusses migrating data from MySQL to Amazon Redshift. It describes MySQL and Redshift, and some of the challenges of migrating between the two systems, such as incompatible schemas and manual processes. The proposed solution is to use a cloud data lake with schema-on-read to store JSON event data, which can then be loaded into Redshift, a cloud data warehouse with schema-on-write, providing an automated way to migrate data between different systems and schemas.
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
My talk from GOTO Aarhus, 30th September 2014. Cogenta is a retail intelligence company which tracks ecommerce web sites around the world to provide competitive monitoring and analysis services to retailers. Using its proprietary crawler technology, Lucene and SQL Server, a stream of 20 million raw product data entries is captured and processed each day. This case study looks at how Cogenta uses Elasticsearch to break the shackles imposed by the RDBMS (and a limited budget) to make the data available in real time to its customers.
Cogenta uses SQL as its canonical store & for complex reporting, and Elasticsearch for real-time processing & to drive its SaaS web applications. Elasticsearch is easy to use, delivers the powerful features of Lucene and enables the data & platform cost to scale linearly. But… synchronising your existing data in two places presents some interesting challenges such as aggregation and concurrency control. This talk will take a detailed look at how Cogenta how overcame those challenges, with a perpetually changing and asynchronously updated dataset.
http://gotocon.com/aarhus-2014/presentation/Cogenta%20-%20Making%20Enterprise%20Data%20Available%20in%20Real%20Time%20with%20Elasticsearch
Fluentd is an open source data collector that allows for flexible and extensible logging. It provides a unified way to collect logs, metrics, and events from various sources and send them to multiple destinations. It handles concerns like buffering, retries, and failover to provide reliable data transfer. Fluentd uses a plugin-based architecture so it can support many use cases like simple forwarding, lambda architectures, stream processing, and logging for Docker and Kubernetes.
Real Time Data Analytics with MongoDB and Fluentd at WishMongoDB
Wish uses Fluentd and MongoDB for analytics. Fluentd is used to centrally collect and aggregate logs from application servers. The aggregated logs are then stored in MongoDB for fast querying and analysis. Hadoop and Hive are also used for log analysis but running a Hadoop cluster can be difficult, so analysis results are stored in MongoDB for quick access. Tools like Dashy and Perimeter are used to visualize analytics data and report on A/B tests. The analytics platform aims to enable faster experimentation and growth for Wish.
Christian Gügi presented on building scalable big data pipelines. He discussed opportunities and challenges with big data, integrating Hadoop into existing systems, and the Lambda architecture for batch and real-time processing. He provided an example pipeline for online marketing analytics and recommendations for adopting the Lambda architecture approach.
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
This document discusses Azure big data capabilities including the 5 V's of big data: volume, velocity, variety, veracity, and value. It notes that 60% of big data projects fail to move beyond pilot according to Gartner. It then provides details on Azure persistence choices for storing big data including storage, Data Lake, HDInsight, DocumentDB, SQL databases, and Hadoop options. It also discusses load and data cleaning choices on Azure like Stream Analytics, SQL Server, and Azure Machine Learning. Finally, it presents 5 architectural patterns for using Azure big data capabilities.
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
This document introduces several new Google cloud technologies: Google Storage for storing data in Google's cloud, the Prediction API for machine learning and predictive analytics, and BigQuery for interactive analysis of large datasets. It provides overviews and examples of using each service, highlighting their capabilities for scalable data storage, predictive modeling, and fast querying of massive amounts of data.
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
This document summarizes a typical day for a Druid architect. It describes common tasks like evaluating production clusters, analyzing data and queries, and recommending optimizations. The architect asks stakeholders questions to understand usage and helps evaluate if Druid is a good fit. When advising on Druid, the architect considers factors like data sources, query types, and technology stacks. The document also provides tips on configuring clusters for performance and controlling segment size.
Learn what you need to consider when moving from the world of relational databases to a NoSQL document store.
Hear from Developer Advocate Glynn Bird as he explains the key differences between relational databases and JSON document stores like Cloudant, as well as how to dodge the pitfalls of migrating from a relational database to NoSQL.
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleData Con LA
Abstract:- Come learn about Google BigQuery and its underlying architecture. Felipe will go over the evolution of BigQuery and explain some of the underlying principles of BigQuery and Dremel. Felipe will also go over some of the latest use cases and will demo a use case of Google BigQuery
Bio:-
Felipe Hoffa moved from Chile to San Francisco to join Google as a Software Engineer. Since 2013 he's been a Developer Advocate on big data - to inspire developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several YouTube videos, blog posts, and conferences around the world.
Follow Felipe at https://twitter.com/felipehoffa.
This document summarizes an open source scalable log analytics solution. The solution uses Lumberjack for log collection, Logstash for indexing and filtering logs, Redis for buffering, Elasticsearch for indexing and searching logs, MongoDB for document storage, and Kibana with D3.js for visualization. Logs are collected from servers by Lumberjack, sent to Logstash for processing, indexed by Elasticsearch for searching, stored in MongoDB for retrieval, and visualized through dashboards and reports in Kibana. The solution allows for real-time log analysis, flexible searching and filtering, and scales horizontally as needs grow.
Data-Driven Development Era and Its TechnologiesSATOSHI TAGOMORI
This document discusses data-driven development and the technologies used in the data analytics process. It covers topics like data collection, storage, processing, and visualization. The document advocates using managed cloud services for data and analytics to focus on data instead of managing infrastructure. Choosing technologies should be based on the type of data and problems to solve, not the other way around. Services like Google BigQuery, Amazon Redshift, and Treasure Data are recommended for their ease of use.
BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution)
Avec l'extraction de données stockées dans une base de données relationnelle à l'aide d'un outil de BI avancé, et avec l'envoi via Kafka des données vers Tachyon, plusieurs sessions Spark peuvent travailler sur le même dataset en limitant la duplication. On obtient grâce à cela une communication à coût contrôlé entre la base de données d'origine et Spark ce qui permet de réintroduire de manière dynamique les données modifiées avec MLlib tout en travaillant sur des données à jour. Les résultats préliminaires seront partagés durant cette présentation.
Options for Data Prep - A Survey of the Current MarketDremio Corporation
Data comes in many shapes and sizes, and every company struggles to find ways to transform, validate, and enrich data for multiple purposes. The problem has been around as long as data, and the market has an overwhelming number of options. In this presentation we look at the problem and key options from vendors in the market today. Dremio is a new approach that eliminates the need for stand alone data prep tools.
This document discusses using real-time data with Superset by ingesting data streams using Apache Flink and storing the data in Apache Druid. It outlines use cases like operational insights and service debugging. It proposes using Flink for its scalable streaming capabilities and SQL interface and integrating with Druid for its columnar storage and geospatial support. It also describes transforming the data through UDFs before querying and visualizing it in Superset.
Presto is an open source distributed SQL query engine that allows querying of data across different data sources. It was originally developed by Facebook and is now used by many companies. Presto uses connectors to query various data sources like HDFS, S3, Cassandra, MySQL, etc. through a single SQL interface. Companies like Facebook and Teradata use Presto in production environments to query large datasets across different data platforms.
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
In this session, we’ll follow the flow of data through an end-to-end system built to handle tens of terabytes an hour of event-oriented data, providing real-time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive can be stitched together to form the base platform; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. Finally, a brief demo of Rocana Ops, an application for large scale data center operations, will be given, along with an explanation about how it uses the underlying platform.
This document provides an overview of the role of a support engineer at TreasureData. It discusses the tools and services used to provide support, including Desk.com, Olark, Jira, and Slack. It describes how support engineers help customers by answering questions, improving queries, and investigating logs. Support engineers also aim to improve the product by sharing customer feedback. Challenges mentioned include streamlining internal support processes, migrating to a new support system, building a customer database, and establishing support key performance indicators.
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
My talk from GOTO Aarhus, 30th September 2014. Cogenta is a retail intelligence company which tracks ecommerce web sites around the world to provide competitive monitoring and analysis services to retailers. Using its proprietary crawler technology, Lucene and SQL Server, a stream of 20 million raw product data entries is captured and processed each day. This case study looks at how Cogenta uses Elasticsearch to break the shackles imposed by the RDBMS (and a limited budget) to make the data available in real time to its customers.
Cogenta uses SQL as its canonical store & for complex reporting, and Elasticsearch for real-time processing & to drive its SaaS web applications. Elasticsearch is easy to use, delivers the powerful features of Lucene and enables the data & platform cost to scale linearly. But… synchronising your existing data in two places presents some interesting challenges such as aggregation and concurrency control. This talk will take a detailed look at how Cogenta how overcame those challenges, with a perpetually changing and asynchronously updated dataset.
http://gotocon.com/aarhus-2014/presentation/Cogenta%20-%20Making%20Enterprise%20Data%20Available%20in%20Real%20Time%20with%20Elasticsearch
Fluentd is an open source data collector that allows for flexible and extensible logging. It provides a unified way to collect logs, metrics, and events from various sources and send them to multiple destinations. It handles concerns like buffering, retries, and failover to provide reliable data transfer. Fluentd uses a plugin-based architecture so it can support many use cases like simple forwarding, lambda architectures, stream processing, and logging for Docker and Kubernetes.
Real Time Data Analytics with MongoDB and Fluentd at WishMongoDB
Wish uses Fluentd and MongoDB for analytics. Fluentd is used to centrally collect and aggregate logs from application servers. The aggregated logs are then stored in MongoDB for fast querying and analysis. Hadoop and Hive are also used for log analysis but running a Hadoop cluster can be difficult, so analysis results are stored in MongoDB for quick access. Tools like Dashy and Perimeter are used to visualize analytics data and report on A/B tests. The analytics platform aims to enable faster experimentation and growth for Wish.
Christian Gügi presented on building scalable big data pipelines. He discussed opportunities and challenges with big data, integrating Hadoop into existing systems, and the Lambda architecture for batch and real-time processing. He provided an example pipeline for online marketing analytics and recommendations for adopting the Lambda architecture approach.
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
This document discusses Azure big data capabilities including the 5 V's of big data: volume, velocity, variety, veracity, and value. It notes that 60% of big data projects fail to move beyond pilot according to Gartner. It then provides details on Azure persistence choices for storing big data including storage, Data Lake, HDInsight, DocumentDB, SQL databases, and Hadoop options. It also discusses load and data cleaning choices on Azure like Stream Analytics, SQL Server, and Azure Machine Learning. Finally, it presents 5 architectural patterns for using Azure big data capabilities.
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
This document introduces several new Google cloud technologies: Google Storage for storing data in Google's cloud, the Prediction API for machine learning and predictive analytics, and BigQuery for interactive analysis of large datasets. It provides overviews and examples of using each service, highlighting their capabilities for scalable data storage, predictive modeling, and fast querying of massive amounts of data.
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
This document summarizes a typical day for a Druid architect. It describes common tasks like evaluating production clusters, analyzing data and queries, and recommending optimizations. The architect asks stakeholders questions to understand usage and helps evaluate if Druid is a good fit. When advising on Druid, the architect considers factors like data sources, query types, and technology stacks. The document also provides tips on configuring clusters for performance and controlling segment size.
Learn what you need to consider when moving from the world of relational databases to a NoSQL document store.
Hear from Developer Advocate Glynn Bird as he explains the key differences between relational databases and JSON document stores like Cloudant, as well as how to dodge the pitfalls of migrating from a relational database to NoSQL.
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleData Con LA
Abstract:- Come learn about Google BigQuery and its underlying architecture. Felipe will go over the evolution of BigQuery and explain some of the underlying principles of BigQuery and Dremel. Felipe will also go over some of the latest use cases and will demo a use case of Google BigQuery
Bio:-
Felipe Hoffa moved from Chile to San Francisco to join Google as a Software Engineer. Since 2013 he's been a Developer Advocate on big data - to inspire developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several YouTube videos, blog posts, and conferences around the world.
Follow Felipe at https://twitter.com/felipehoffa.
This document summarizes an open source scalable log analytics solution. The solution uses Lumberjack for log collection, Logstash for indexing and filtering logs, Redis for buffering, Elasticsearch for indexing and searching logs, MongoDB for document storage, and Kibana with D3.js for visualization. Logs are collected from servers by Lumberjack, sent to Logstash for processing, indexed by Elasticsearch for searching, stored in MongoDB for retrieval, and visualized through dashboards and reports in Kibana. The solution allows for real-time log analysis, flexible searching and filtering, and scales horizontally as needs grow.
Data-Driven Development Era and Its TechnologiesSATOSHI TAGOMORI
This document discusses data-driven development and the technologies used in the data analytics process. It covers topics like data collection, storage, processing, and visualization. The document advocates using managed cloud services for data and analytics to focus on data instead of managing infrastructure. Choosing technologies should be based on the type of data and problems to solve, not the other way around. Services like Google BigQuery, Amazon Redshift, and Treasure Data are recommended for their ease of use.
BigQuery is Google's columnar, massively parallel data querying solution. This talk explores using it as an ad-hoc reporting solution and the limitations present in May 2013.
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution)
Avec l'extraction de données stockées dans une base de données relationnelle à l'aide d'un outil de BI avancé, et avec l'envoi via Kafka des données vers Tachyon, plusieurs sessions Spark peuvent travailler sur le même dataset en limitant la duplication. On obtient grâce à cela une communication à coût contrôlé entre la base de données d'origine et Spark ce qui permet de réintroduire de manière dynamique les données modifiées avec MLlib tout en travaillant sur des données à jour. Les résultats préliminaires seront partagés durant cette présentation.
Options for Data Prep - A Survey of the Current MarketDremio Corporation
Data comes in many shapes and sizes, and every company struggles to find ways to transform, validate, and enrich data for multiple purposes. The problem has been around as long as data, and the market has an overwhelming number of options. In this presentation we look at the problem and key options from vendors in the market today. Dremio is a new approach that eliminates the need for stand alone data prep tools.
This document discusses using real-time data with Superset by ingesting data streams using Apache Flink and storing the data in Apache Druid. It outlines use cases like operational insights and service debugging. It proposes using Flink for its scalable streaming capabilities and SQL interface and integrating with Druid for its columnar storage and geospatial support. It also describes transforming the data through UDFs before querying and visualizing it in Superset.
Presto is an open source distributed SQL query engine that allows querying of data across different data sources. It was originally developed by Facebook and is now used by many companies. Presto uses connectors to query various data sources like HDFS, S3, Cassandra, MySQL, etc. through a single SQL interface. Companies like Facebook and Teradata use Presto in production environments to query large datasets across different data platforms.
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
In this session, we’ll follow the flow of data through an end-to-end system built to handle tens of terabytes an hour of event-oriented data, providing real-time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive can be stitched together to form the base platform; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. Finally, a brief demo of Rocana Ops, an application for large scale data center operations, will be given, along with an explanation about how it uses the underlying platform.
This document provides an overview of the role of a support engineer at TreasureData. It discusses the tools and services used to provide support, including Desk.com, Olark, Jira, and Slack. It describes how support engineers help customers by answering questions, improving queries, and investigating logs. Support engineers also aim to improve the product by sharing customer feedback. Challenges mentioned include streamlining internal support processes, migrating to a new support system, building a customer database, and establishing support key performance indicators.
How to make your open source project MATTER
Let’s face it: most open source projects die. “For every Rails, Docker and React, there are thousands of projects that never take off. They die in the lonely corners of GitHub, only to be discovered by bots scanning for SSH private keys.
Over the last 5 years, I worked on and off on marketing a piece of infrastructure middleware called Fluentd. We tried many things to ensure that it did not die: From speaking at events, speaking to strangers, giving away stickers, making people install Fluentd on their laptop. Most everything I tried had a small, incremental effect, but there were several initiatives/hacks that raised Fluentd’s awareness to the next level. As I listed up these “ideas that worked”, I noticed the common thread: they all brought Fluentd into a new ecosystem via packaging.”
This document provides an introduction and overview of Hivemall, an open source machine learning library built as a collection of Hive UDFs. It begins with background on the presenter, Makoto Yui, and then covers the following key points:
- What Hivemall is and its vision of bringing machine learning capabilities to SQL users
- Popular algorithms supported in current and upcoming versions, such as random forest, factorization machines, gradient boosted trees
- Real-world use cases at companies such as for click-through rate prediction, user profiling, and churn detection
- How to use algorithms like random forest, matrix factorization, and factorization machines from Hive queries
- The development roadmap, with plans to support NLP
This presentation describes the common issues when doing application logging and introduce how to solve most of the problems through the implementation of an unified logging layer with Fluentd.
* 행사 정보 :2016년 10월 14일 MARU180 에서 진행된 '데이터야 놀자' 1day 컨퍼런스 발표 자료
* 발표자 : Dylan Ko (고영혁) Data Scientist / Data Architect at Treasure Data
* 발표 내용
- 데이터사이언티스트 고영혁 소개
- Treasure Data (트레저데이터) 소개
- 데이터로 돈 버는 글로벌 사례 #1
>> MUJI : 전통적 리테일에서 데이터 기반 O2O
- 데이터로 돈 버는 글로벌 사례 #2
>> WISH : 개인화&자동화를 통한 쇼핑 최적화
- 데이터로 돈 버는 글로벌 사례 #3
>> Oisix : 머신러닝으로 이탈고객 예측&방지
- 데이터로 돈 버는 글로벌 사례 #4
>> 워너브로스 : 프로세스 자동화로 시간과 돈 절약
- 데이터로 돈 버는 글로벌 사례 #5
>> Dentsu 등의 애드테크(Adtech) 회사들
- 데이터로 돈을 벌고자 할 때 반드시 체크해야 하는 것
Keynote on Fluentd Meetup Summer
Related Slide
- Fluentd ServerEngine Integration & Windows Support http://www.slideshare.net/RittaNarita/fluentd-meetup-2016-serverengine-integration-windows-support
- Fluentd v0.14 Plugin API Details http://www.slideshare.net/tagomoris/fluentd-v014-plugin-api-details
- MongoDB is well-suited for systems of engagement that have demanding real-time requirements, diverse and mixed data sets, massive concurrency, global deployment, and no downtime tolerance.
- It performs well for workloads with mixed reads, writes, and updates and scales horizontally on demand. However, it is less suited for analytical workloads, data warehousing, business intelligence, or transaction processing workloads.
- MongoDB shines for use cases involving single views of data, mobile and geospatial applications, real-time analytics, catalogs, personalization, content management, and log aggregation. It is less optimal for workloads requiring joins, full collection scans, high-latency writes, or five nines u
MongoDb is a document oriented database and very flexible one as it gives horizontal scalability.
In this presentation basic study about mongodb with installation steps and basic commands are described.
MongoDB and NoSQL use cases address trends of more and complex data, cloud computing, and fast application development. MongoDB provides horizontal scaling, ability to store complex data without pain, compatibility with object-oriented languages and frequent releases, high single-server performance, and cloud friendliness. However, it offers no complex transactions. Suitable use cases include high data volumes, complex data models, real-time analytics, agile development, and cloud deployment. Examples of users are given for content management, operational intelligence, metadata management, high-volume data feeds, marketing personalization, and dictionary services.
When it comes time to select database software for your project, there are a bewildering number of choices. How do you know if your project is a good fit for a relational database, or whether one of the many NoSQL options is a better choice?
In this webinar you will learn when to use MongoDB and how to evaluate if MongoDB is a fit for your project. You will see how MongoDB's flexible document model is solving business problems in ways that were not previously possible, and how MongoDB's built-in features allow running at scale.
Topics covered include:
Performance and Scalability
MongoDB's Data Model
Popular MongoDB Use Cases
Customer Stories
Why Organizations are Looking at Alternative Database Technologies – Introduc...DATAVERSITY
This webinar will first walk through the main forces driving developers and IT organizations to adopt non-relational or NoSQL databases. Next it will cover the key concepts and terminology used in the NoSQL space. Finally, using MongoDB as an example, the webinar will highlight examples of how organizations have put NoSQL technology to use in order to drive their business objectives.
The document discusses using MongoDB as a tick store for financial data. It provides an overview of MongoDB and its benefits for handling tick data, including its flexible data model, rich querying capabilities, native aggregation framework, ability to do pre-aggregation for continuous data snapshots, language drivers and Hadoop connector. It also presents a case study of AHL, a quantitative hedge fund, using MongoDB and Python as their market data platform to easily onboard large volumes of financial data in different formats and provide low-latency access for backtesting and research applications.
Which database should I use for my app?
SQL vs NoSQL databases.
What is Polyglot Persistence?
What are different types of databases out there?
Introduction to CloudBoost : http://www.cloudboost.io
Building your first app with CloudBoost.io
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think
Getting Started with Big Data in the CloudRightScale
Find out what others are doing with big data on the cloud and how to get started. We will cover solutions for Hadoop and NoSQL highlighting RightScale partner technologies IBM BigInsights, Couchbase, and MongoDB.
When to Use MongoDB...and When You Should Not...MongoDB
MongoDB is well-suited for applications that require:
- A flexible data model to handle diverse and changing data sets
- Strong performance on mixed workloads involving reads, writes, and updates
- Horizontal scalability to grow with increasing user needs and data volume
Some common use cases that leverage MongoDB's strengths include mobile apps, real-time analytics, content management, and IoT applications involving sensor data. However, MongoDB is less suited for tasks requiring full collection scans under load, high write availability, or joins across collections.
Adoption of MongoDB has accelerated tremendously among developers in the past 18 months, and many large enterprises have now deployed MongoDB in reliable and large scale production environments. However, for many developers, it remains a challenge to convince production teams and business stakeholders to adopt an open source technology that has not been certified yet by their IT teams. This session will provide you with the compelling arguments to reassure business and production teams such as:
Public customer references and real-world case studies (migration, and adoption stories)
Deployment support and practices for robustness
How MongoDB contributes to your company’s business value
MongoDB Versatility: Scaling the MapMyFitness PlatformMongoDB
Chris Merz, Manager of Operations, MapMyFitness
The MMF user base more than doubled in 2011, beginning an era of rapid data growth. With Big Data come Big Data Headaches. The traditional MySQL solution for our suite of web applications had hit its ceiling. MongoDB was chosen as the candidate for exploration into NoSQL implementations, and now serves as our go-to data store for rapid application deployment. This talk will detail several of the MongoDB use cases at MMF, from serving 2TB+ of geolocation data, to time-series data for live tracking, to user sessions, app logging, and beyond. Topics will include migration patterns, indexing practices, backend storage choices, and application access patterns, monitoring, and more.
Recording: https://www.youtube.com/watch?v=qHkXVY2LpwU
External links: https://gist.github.com/itamarhaber/dddc3d4d9c19317b1477
Applications today are required to process massive amounts of data and return responses in real time. Simply storing Big Data is no longer enough; insights must be gleaned and decisions made as soon as data rushes in. In-memory databases like Redis provide the blazing fast speeds required for sub-second application response times. Using a combination of in-memory Redis and disk-based MongoDB can significantly reduce the “digestive” challenge associated with processing high velocity data.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
MongoDB Breakfast Milan - Mainframe Offloading StrategiesMongoDB
The document summarizes a MongoDB event focused on modernizing mainframe applications. The event agenda includes presentations on moving from mainframes to operational data stores, demo of a mainframe offloading solution from Quantyca, and stories of mainframe modernization. Benefits of using MongoDB for mainframe modernization include 5-10x developer productivity and 80% reduction in mainframe costs.
The document discusses the rapid growth of data on the web and how NoSQL databases provide an alternative to traditional relational databases by being able to handle massive amounts of unstructured and semi-structured data across a large number of servers in a simple and scalable way. It reviews different types of NoSQL databases like key-value stores, document databases, and graph databases and provides examples of popular NoSQL databases like MongoDB, CouchDB, HBase, and Neo4j that are being used by large companies to store and query large datasets.
The document discusses NoSQL databases and their advantages compared to SQL databases. It defines NoSQL as any database that is not relational and describes the main categories of NoSQL databases - key-value stores, document databases, wide column stores like BigTable, and graph databases. It also covers common use cases for different NoSQL databases and examples of companies using NoSQL technologies like MongoDB, Cassandra, and HBase.
MongoDB is a document-oriented, high performance, highly available, and horizontally scalable operational database. It addresses challenges with traditional RDBMS like handling high volumes of data, semi-structured and unstructured data types, and the need for agile development. MongoDB can be used for financial services use cases like high volume data feeds, risk analytics, product catalogs, trade capture, reporting, reference data management, portfolio management, quantitative analysis, and automated trading. It provides features like flexible schemas, indexing, aggregation, scaling out through sharding, and integration with Hadoop.
Webinar: How Banks Manage Reference Data with MongoDBMongoDB
1. MongoDB is well-suited for reference data solutions due to its dynamic and flexible schema, built-in replication and high availability features, and tag aware sharding which allows for geographic distribution of data.
2. A case study of a global broker dealer showed how MongoDB could replace expensive and complex ETL processes for distributing reference data, saving over $40 million over 5 years.
3. Key benefits included real-time data distribution, faster querying of local data, and avoiding regulatory penalties from delays in data distribution.
Similar to Augmenting Mongo DB with Treasure Data (20)
The new GDPR regulation went into effect on May 25th. While a majority of conversations have revolved around the security and IT aspects of the law, marketing teams will play a crucial role in helping organizations meet GDPR standards and playing a strategic role across the organization . Join us to learn more, engage with your peers and get prepared.
This webinar will cover:
- How complying with the GDPR will drive better marketing and raise the standard of the quality of your customer engagement
- The GDPR elements marketers must know about
- The elements of PII that will be affected and what marketers need to do about it
- A deep dive on how GDPR regulations will affect your marketing channels - email, programmatic advertising, cold calls, etc.
- Tactical marketing updates needed to meet GDPR guidelines
AR and VR by the Numbers: A Data First Approach to the Technology and MarketTreasure Data, Inc.
The document discusses trends in the augmented reality (AR) and virtual reality (VR) markets. It notes that the combined AR and VR market is estimated to reach $120 billion by 2020, with AR's market estimated at $89.9 billion and VR's at $29.9 billion. While VR growth is clear, the exact size is unclear. The document outlines challenges like the need for improved headsets and continued developer investment outside of mobile. It emphasizes that AR currently focuses on using data to project context and enable interaction with the real world, and that collecting user data is important for defining the experience.
An overview of Customer Data Platforms (CDP) with the industry leader who coined the term, David Raab. Find out how to use Live Customer Data to create a better customer experience and how Live Data Management can give you a competitive edge with a 360 degree view of your clients.
Learn:
- The definition and requirements for Customer Data Platforms
- The differences between Customer Data Platforms and comparative technologies such as Data Warehousing and Marketing Automation
- Reference architectures/approaches to building CDP
- How Treasure Data is used to build Customer Data Platforms
And here's the song: https://youtu.be/RalMozVq55A
In this hands-on webinar we will cover how to leverage the Treasure Data Javascript SDK library to ensure user stitching of web data into the Treasure Data Customer Data Platform to provide a holistic view of prospects and customers.
We will demo the native SDK, as well as deploying the SDK inside of Adobe DTM and Google Tag Manager.
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowTreasure Data, Inc.
In this hands-on webinar we'll explore the data warehousing concept of Slowly Changing Dimensions (SCDs) and common use cases for managing SCDs when dealing with customer data. This webinar will demonstrate different methods for tracking SCDs in a data warehouse, and how Treasure Data Workflow can be used to create robust data pipelines to handle these processes.
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsTreasure Data, Inc.
Gaming companies with multiple products often struggle to calculate accurate Customer Lifetime Value (CLTV) across their portfolio. This is because user data is often analyzed in silos so companies are unable to get a clear picture of ROI and CLTV across platforms, devices and apps.
In this webinar we’ll look at how you can apply a holistic and complete approach to your CLTV and ROI through the lens of gaming companies, though this technique is applicable for any company who has products spanning platforms.
We’ll also explore:
How the integral power of data in business has shifted over the past 10 years.
Discover the current technologies and processes used to analyze data across different platforms by combining multiple data streams, looking at examples in brand and portfolio-based LTV.
How to process and centralize dozens of varying data streams.
Nicolas Nadeau will speak from his extensive experience and show how leveraging data from multiple product strategies spanning many platforms can be highly beneficial for your company.
Do you know what your top ten 'happy' customers look like? Would you like to find ten more just like them? Come learn how to leverage 1st & 3rd party data to map your customer journey and drive users down a path where every interaction is personalized, fun, & data-driven. No more detractors, power your Customer Experience with data!
In this webinar you will learn:
-When, why, and how to leverage 1st, 2nd, and 3rd party data
-Tips & Tricks for marketers to become more data driven when launching their campaigns
-Why all marketers needs a 360 degree customer view
The reality is virtual, but successful VR games still require cold, hard data. For wildly popular games like Survios’ Raw Data, the first VR-exclusive game to reach #1 on Steam’s Global Top Sellers list, data and analytics are the key to success.
And now online gaming companies have the full-stack analytics infrastructure and tools to measure every aspect of a virtual reality game and its ecosystem in real time. You can keep tabs on lag, which ruins a VR experience, improve gameplay and identify issues before they become showstoppers, and create fully personalized, completely immersive experiences that blow minds and boost adoption, and more. All with the right tools.
Make success a reality: Register now for our latest interactive VB Live event, where we’ll tap top experts in the industry to share insights into turning data into winning VR games.
Attendees will:
* Understand the role of VR in online gaming
* Find out how VR company Survios successfully leverages the Exostatic analytics infrastructure for commercial and gaming success
* Discover how to deploy full-stack analytics infrastructure and tools
Speakers:
Nicolas Nadeau, President, Exostatic
Kiyoto Tamura, VP Marketing, Treasure Data
Ben Solganik, Producer, Survios
Stewart Rogers, Director of Marketing Technology, VentureBeat
Wendy Schuchart, Moderator, VentureBeat
The document discusses how marketers can better leverage customer data to improve the customer experience. It provides tips from various experts on developing a robust data strategy, asking the right questions of data to uncover insights, owning customer data to stay compliant with regulations, and how IoT can be used to inform and deploy customer experience solutions. The overall message is that marketers need to stop data from being fragmented and better connect customer touchpoints to deliver personalized experiences.
Harnessing Data for Better Customer Experience and Company SuccessTreasure Data, Inc.
As big data has exploded, the ability for companies to easily leverage it has imploded. Organizations are drowning in their own information, unable to see the forest through the trees, while the big players consistently outperform in their ability to deliver a great customer experience, faster, cheaper…As a result, the vast majority of companies are scrambling to catch up and become more agile, data-driven, to use their data more effectively so they can attract and retain their elusive customers...
In this joint deck by 451 Research and Treasure Data, you will learn how to enable your line of business team to own their own data (instead of relying on IT) to be able to:
- deliver a single, persistent view of your customer based on behavior data
- make that data accessible to the right people at the right time
- Increase organizational effectiveness by (finally) breaking down silos with data
- enable powerful marketing tools to enhance the customer experience
This document summarizes Johan Gustavsson's presentation on scalable Hadoop in the cloud. It discusses (1) replacing an on-premise Hadoop cluster with Plazma storage on S3 and job execution in containers, (2) how jobs are isolated either through individual JobClients or resource pools, and (3) ongoing architecture changes through the Patchset Treasure Data initiative to support multiple Hadoop versions and improve high availability of job submission services.
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...Treasure Data, Inc.
This document discusses migrating data from MySQL to Amazon Redshift. It describes MySQL and Redshift, and some of the challenges of migrating between the two systems, such as incompatible schemas and manual processes. The proposed solution is to use a cloud data lake with schema-on-read to store JSON event data, which can then be loaded into Redshift, a cloud data warehouse with schema-on-write, providing an automated way to migrate data between different systems and schemas.
Pebble uses data science and analytics to improve its smartwatch products. Pebble's data team analyzes over 60 million records per day from the watches to measure user engagement, identify issues, and inform new product design. Their first problem was setting an engagement threshold using the accelerometer. Rapid testing of different thresholds against "backlight data" validated the optimal threshold. Pebble has since solved many problems using their analytics infrastructure at Treasure Data to query, explore, and gain insights from massive user data in real-time.
This document discusses a tech talk given by Makoto Yui at Treasure Data on May 14, 2015. It includes an introduction to Hivemall, an open source machine learning library built on Apache Hive. The talk covers how to use Hivemall for tasks like data preparation, feature engineering, model training, and prediction. It also discusses doing real-time prediction by training models offline on Hadoop and performing online predictions using the models on a relational database management system.
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
3. About Me
• A recovering software engineer turned digital artist
once interested in fractals;
• now into data visualization based on large datasets
rendered directly to GPU (RGL, various Python GL
libraries, etc.)
• it’s easier these days to manipulate large dataset
with limited effort
4. Images courtesy of Edureka, 10gen, MongoDB,
clipart panda and aperfectworld.org
9. “not so strengths” of
MongoDB
• Horowitz was also very honest about where and how
MongoDB is lacking in its current offering – most notably in
terms of integration capabilities and some areas of high
performance.
• “In the relational world you’ve got a few big boxes, in the
MongoDB world you could have 2,000 commodity servers,
so you need really great management tools for that.
That’s a huge problem for us.”
• “The other big thing is automation, where you can have
automation tools that let you manage very large clusters all
from a very simple pane of glass.”
http://diginomica.com/2014/11/10/mongodb-cto-mongo-works-doesnt/
From an interview with MongoDB CTO Eliot Horowitz
11. hmmm…moar “not so strengths” ;)
of MongoDB
• The dreaded “Write Lock”
• https://news.ycombinator.com/item?id=1691748
• http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
- is the data actually relational or not?
• Slideshare “where not to use MongoDB”
• Slideshare “Hive vs. Cassandra vs. MongoDB”
• http://www.slideshare.net/johnrjenson/mongodb-pros-and-cons
• “choosing the right NoSQL database” video:
https://www.youtube.com/watch?t=34&v=gJFG04Sy6NY
• limits of MongoDB: http://docs.mongodb.org/manual/reference/limits/
24. Complementing MongoDB
• Operationally?
• Managing MongoDB is hard (spin up instances
with Mongolab)
• MongoDB (monitoring products: Ops Manager
and MMS)
• What’s your pain? What’s are you missing?
29. DATA ACCESS > ADVANCED ANALYTICS
• Product analytics: data
access is a major issue.
• “Machine learning” is still
simple and “small” in scale
(can be done inside Python)
• Future work:
productized/operationalized
machine learning
bluetooth
iOS/Android SDK
Fluentd
Python/Pandas SDK
Data Science Team (5 people)
30. SCHEMA(LESS) COUNTS
• Redshift: lots of co-use
cases
• Event data is semi-
structured → Can be
modeled as JSON but
schemas change
• Treasure Data provides a
SQL-accessible, semi-
structured data lake.
email
Source of truth/JSON
More intensive data processing
Hourly/Daily load
Big data mart
More interactive data processing
Ad hoc queries
for new data
Ad hoc queries
for cached data
31. DATA COLLECTION IS HARD
• Want to assume all data is
on S3 or HDFS, but reality is
murkier.
• Sensor readings available as
email attachments
• Provide data collection tools
for 90% of the use cases.
Have APIs ready for 10%.
GH SCADA
email
Parse & transform
Import via REST
Import data as JSON
Analyze via SQL
Query
Results
Data-informed
maintenance
33. Some revised scenarios
• Revised scenario 1: Using Treasure Data for
Ingestion and analytics; exporting results to
MongoDB for reporting
34. Some revised scenarios
• Revised scenario 2: Ingestion data into MongoDB
and exporting to Treasure Data
35. Treasure Data is good for a
some of the same things…
• less overhead in setup
• less - make that practically no - effort to scale
• less overhead/effort to use
• but -> less fine-tuned control over outcome
Just a quick review…
Sharding strategies: Range sharding - (shard key divided by e.g. device id by range)
Hash Sharding: MongoDB applies a MD5 hash on the key when the subkey is used
Tag-Aware Sharding: allows a subset of shards to be tagged, and assigned to a sub-range of the shard key
The folks at Edureka did a comparative study of different database types.
Breakoff discussion: Let’s talk about what kind of databases we’re using, and for what purposes.
Question to audience: How do mongo and HBase (Plazma?) fall on the boundary between partition tolerance and consistency? (Might consider leaving this slide out
BSON looks like JSON and translate nicely to things like python dictionaries. Working the Mongo prompt is easy but requires mastering another API/paradigm.
What are some other strengths of MongoDB
Managing MongoDB is hard (spin up instances with Mongolab)
MongoDB (monitoring products: Ops Manager and MMS)
If you can lose 5sec. worth of updates, a MongoDB replication pair is just fine. If you can lose a day's worth of updates (or can easily reconstruct the database contents from other sources), you can try out pretty much anything without bad repercussions. If you can't lose anything, you're pretty much limited to the most conservative databases (the SQL bunch).
Any places where this process could be problematic? One is a failure before cache is written, during finalize. Another could be a failure during any step of the M-R process.
NOTE: Need transitions on this slide to control how things appear with my story
We start in the app which generates the logs. The app synchronously logs to a fluentd running on the same host. There’s no network latency and the load on each local fluentd is trivial, so we’ve never had problems with these getting slow
The local fluentd accepts the logs and buffers them on disk for reliability. Periodically, it flushes those buffers to one of the hosts in our fluentd aggregation tier with at-least-once semantics. These run active/active and can be scaled out linearly.
They also buffer on disk and periodically flush into Hadoop. As an added bonus, they also flush into S3 for backup. This tier gives us an easy to monitor & manage conduit for our logs to flow through without imposing extra costs on the app.
To recap, the logs from our app are buffered by a fluentd on the same host. That reliably forwards to a tier of aggregation fluentds, which forward to Hadoop and S3.
The sunk cost fallacy is the idea that your sunk costs (unrecoverable) create barriers to adjusting your future spending. For example, “I’m hungry. Therefore I should eat that egg salad in the fridge (even if it’s gone bad) because I’ve already spent the money on it (rather than going for fresh food.”