Logging infrastructure for Microservices using StreamSets Data Collector

•

0 likes•1,392 views

This document discusses using StreamSets Data Collector (SDC) to build a logging infrastructure for microservices. SDC can ingest logs from microservices running in containers and handle issues like schema changes and new log formats. It processes and transforms the logs, sending them to destinations like Kafka. SDC pipelines can run on Spark clusters on Yarn and Mesos to handle large volumes of log data and load it into systems like HDFS, HBase and Elasticsearch for analysis.

Software

Logging infrastructure for MicroServices using StreamSets Data Collector
Logging Infrastructure for microservices using StreamSets
Data Collector
Presenter:
Virag Kothari
Software Engineer at StreamSets

© 2015 StreamSets, Inc. All rights reserved.
About StreamSets
● Headquartered in San Francisco, CA
● Deep expertise in enterprise data management and integration
○ Girish Pancha, CEO (Formerly Chief Product Officer at Informatica)
○ Arvind Prabhakar, CTO (Formerly Director, Engineering for Integration at Cloudera)
○ Team includes Apache PMC members for Flume, Sqoop, Hadoop, Oozie, Hive, Storm

© 2015 StreamSets, Inc. All rights reserved.
Containerized services
Run batch jobs, application jobs, microservices
Logging is key in dynamic environments
HBase/Cassandra
HDFS/S3
Elasticsearch
Docker Container
Docker Container
Kafka
Application
Flume/Logstash

© 2015 StreamSets, Inc. All rights reserved.
Challenges
Semi structured logs
Semantic drift
-> Schema changes
-> Malformed records
Infrastructure drift
->New apps with their own log format

© 2015 StreamSets, Inc. All rights reserved.
StreamSets Data Collector (SDC) Pipeline
Origin
(Log Source)
Processor
Destination
(Kafka)
On
success
Kafka/Write
to File
On error
Application
Docker
container

© 2015 StreamSets, Inc. All rights reserved.
Handle semantic and infrastructure drift
● Built in transformations
● Scripting support
● Troubleshoot using snapshots
● Rules and alerting

© 2015 StreamSets, Inc. All rights reserved.
Data at scale
● Streaming/Batch Cluster deployments
● Batch - MapReduce
● Streaming - Spark Streaming on Mesos and Yarn
● Storm, Samza and others?

© 2015 StreamSets, Inc. All rights reserved.
Cluster pipeline
Kafka
Spark executor
Task Task
SDC SDC
Yarn/Mesos
HDFS/S3
HBase/Cassandra
Hive
Solr

© 2015 StreamSets, Inc. All rights reserved.
Spark Streaming + Kafka
Direct Approach
One to one mapping between Kafka and RDD partitions
Allocate executors equal to Kafka partitions
Multiple tasks within executor
Kafka partition RDD partition SDC

© 2015 StreamSets, Inc. All rights reserved.
Spark on Yarn
Client vs Cluster mode
Fault tolerant driver
Jars available through Distributed Cache
Classloader isolation due to conflicting libraries

© 2015 StreamSets, Inc. All rights reserved.
Spark on Mesos
Mesos not a framework manager
REST endpoint provided by Spark to manage the Mesos framework
No Distributed Cache
Fault-tolerance through pipeline-level retries

© 2015 StreamSets, Inc. All rights reserved.
Thank you
http://streamsets.com/careers/
We’re hiring...
https://github.com/streamsets

As GoPro expands into content networks and launches new products, new challenges have appeared. One of the most critical challenges facing GoPro during this period of rapid growth is their ability to make effective use of massive amounts of data. Every day, GoPro collects increasing amounts of data generated by internet connected consumer devices (smart cameras, smart drones), GoPro mobile apps, GoPro content networks, GoPro e-commerce sales, and social media. This data ranges from raw camera logs to refined and well-structured e-commerce datasets. In the past, it took GoPro months to understand new inbound data and determine how to transform or augment it for analysis. To streamline this process and bridge the gap between tech-savvy engineers and data-savvy analysts, GoPro is creating an analysis loop, which informs product usage trends and product insights. This analysis loop serves a large ecosystem of GoPro executives, product managers, engineers, data scientists, and business analysts through an integrated technology pipeline consisting of Apache Kafka, Apache Spark Streaming, Cloudera’s distribution of Hadoop, and Tableau’s Data Visualization Software as the end user analytical tool. Session sponsored by Tableau Software.

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services

AWS re:Invent 2016: Migrating a Highly Available and Scalable Database from O...

Amazon Web Services

In this session, we share how an Amazon.com team that owns a document management platform that manages billions of critical customer documents for Amazon.com migrated from a relational to a non-relational database. Initially, the service was built as an Oracle database. As it grew, the team discovered the limits of the relational model and decided to migrate to a non-relational database. They chose Amazon DynamoDB for its built-in resilience, scalability, and predictability. We provide a template that you can use to migrate from a relational data store to DynamoDB. We also provide details about the entire process: design patterns for moving from a SQL schema to a NoSQL schema; mechanisms used to transition from an ACID (Atomicity, Consistency, Isolation, Durability) model to an eventually consistent model; migration alternatives considered; pitfalls in common migration strategies; and how to ensure service availability and consistency during migration.

Co 4, session 2, aws analytics services

m vaishnavi

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Amazon Web Services

During this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, will discuss the design and architecture of Airbnb's streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb's data warehouse, using a system called SpinalTap. We will also discuss how we leverage Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables.

Choosing the Right Database for the Job: Relational, Cache, or NoSQL?

Amazon Web Services

Developers and DBAs from a traditional relational background are spoilt for choice when looking to integrate caching and NoSQL into an application architecture to solve scaling problems and reduce costs. Even when using relational databases there are 3 managed database services on AWS for the MySQL engine alone. Trying to evaluate all the options often creates analysis paralysis, resulting in a reluctance to try something new or different. This session will guide you through a series of use cases that use different databases to solve business problems that customers face today.

Beyond Relational

Lynn Langit

Full Stack Analytics on AWS - AWS Summit Cape Town 2017

Amazon Web Services

Building analytics applications requires more than just one good service. It requires the ability to capture a vast amount of data, and react to data changes in real time. It requires flexible tools which enable end users to work in the way they can be most productive, and which addresses the needs of both data consumers, as well as data scientists. This analysis won't just be about data exploration and reports, but must be able to support the largest scale, complex machine and deep learning models imaginable. Across it all, strong governance, security, and cataloguing is essential. In this session, come to hear about how to build a full stack analytics application using AWS Services. We'll see how to capture static and dynamic data in real time, and react to data changes. We'll see AWS Services which perform analytics from drag-and-drop, through simple query-on-files, and into exascale data science. At the end, we'll have a data lake architecture that will meet the demands of the most sophisticated analytics customers for many years to come. AWS Speaker: Ian Robinson, Specialist Solution Architect, Big Data and Analytics, EMEA - Amazon Web Services

Redis Labs' CMO is hosting a fireside chat with leaders from multiple industries including Groupon (e-commerce ), Intuit (Finance ), and LifeLock (Identity Protection ). This conversation-style session will cover the Big Data related challenges faced by these leading companies as they scale their applications, ensure high availability, serve the best user experience at lowest latencies, and optimize between cloud and on-premises operations. The introductory level session will appeal to both developer and DevOps functions. They will hear about diverse use cases such as recommendations engine, hybrid transactions and analytics operations, and time-series data analysis. The audience will learn how the Redis in-memory database platform addresses the above use cases with its multi-model capability and in a cost effective manner to meet the needs of the next generation applications. Session sponsored by Redis Labs.

AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)

Amazon Web Services

Learn how to leverage new workflow management tools to simplify complex data pipelines and ETL jobs spanning multiple systems. In this technical deep dive from Treasure Data, company founder and chief architect walks through the codebase of DigDag, our recently open-sourced workflow management project. He shows how workflows can break large, error-prone SQL statements into smaller blocks that are easier to maintain and reuse. He also demonstrates how a system using ‘last good’ checkpoints can save hours of computation when restarting failed jobs and how to use standard version control systems like Github to automate data lifecycle management across Amazon S3, Amazon EMR, Amazon Redshift, and Amazon Aurora. Finally, you see a few examples where SQL-as-pipeline-code gives data scientists both the right level of ownership over production processes and a comfortable abstraction from the underlying execution engines. This session is sponsored by Treasure Data. AWS Competency Partner

Data Design for Microservices - DevDay Austin 2017 Day 2Amazon Web Services

Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services

When the Cloud is a Rockin: High Availability in Apache CloudStack

John Burwell

CloudStack currently provides a variety bespoke high availability mechanisms for resources such as virtual machines, hosts, and virtual routers. Each of these implementations duplicates the HA check/recovery cycle, as well as, concurrency, persistence, and clustering required manage high available for any CloudStack resource. The High Availability Resource Management Service has been developed to consolidate these concerns -- providing a robust, extensible HA mechanism. Using this service, plugins only need to define health check, activity check, and fence operations.

AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...

Amazon Web Services

Building big data applications often requires integrating a broad set of technologies to store, process, and analyze the increasing variety, velocity, and volume of data being collected by many organizations. Using a combination of Amazon EMR, a managed Hadoop framework, and Amazon Redshift, a managed petabyte-scale data warehouse, organizations can effectively address many of these requirements. In this webinar, we will show how organizations are using Amazon EMR and Amazon Redshift to build more agile and scalable architectures for big data. We will look into how you can leverage Spark and Presto running on EMR, to address multiple data processing requirements. We will also share best practices and common use cases to integrate EMR and Redshift. Learning Objectives: • Best practices for building a big data architecture that includes Amazon EMR and Amazon Redshift • Understand how to use technologies such as Amazon EMR, Presto and Spark to complement your data warehousing environment • Learn key use cases for Amazon EMR and Amazon Redshift Who Should Attend: • Data architects, Data management professionals, Data warehousing professionals, BI professionals

Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...

Pat Patterson

On a typical day we see hundreds of downloads of StreamSets Data Collector, our open source data integration tool. We used to wrangle our download logs using a combination of the AWS S3 command line, sed, grep, awk and other tools, all run from a shell script (on my laptop!) once a week. This was a classic example of a brittle, hard to maintain, custom data integration. One day it dawned on me, "This is crazy, we have a tool that can do all this!". In this session, I'll explain how I built a dataflow pipeline to stream content delivery network (CDN) logs from S3 to MySQL in real-time, allowing us to gain valuable insights into our open source community. You'll also learn how we use the same techniques to not only gain insights into our community on Slack, but also build tools to better serve them.

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014

Chris Fregly

Big data on aws

Serkan Özal

Building A Modern Data Analytics Architecture on AWS

Amazon Web Services

Amazon Web Services gives you fast access to flexible and low cost IT resources, so you can rapidly scale and build virtually any big data and analytics application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity, and variety of data. In this one-hour webinar, we will look at the portfolio of AWS Big Data services and how they can be used to build a modern data architecture. We will cover: Using different SQL engines to analyze large amounts of structured data Analysing streaming data in near-real time Architectures for batch processing Best practices for Data Lake architectures This session is suited for: Solution and enterprise architects Data architects/ Data warehouse owners IT & Innovation team members

Understanding AWS Managed Database and Analytics Services | AWS Public Sector...

Amazon Web Services

The world is creating more data in more ways than ever before. The average internet user in 2017 generates 1.5GB of data per day, with the rate doubling every 18 months. A single autonomous vehicle can generate 4TB per day. Each smart manufacturing plant generates 1PB per day. Storing, managing, and analyzing this data requires integrated database and analytic services that provide reliability and security at scale. AWS offers a range of managed data services that let customers focus on making data useful, including Amazon Aurora, RDS, DynamoDB, Redshift, Spectrum, ElastiCache, Kinesis, EMR, Elasticsearch Service, and Glue. In this session, we discuss these services, share our vision for innovation, and show how our customers use these services today. Learn More: https://aws.amazon.com/government-education/

Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2Amazon Web Services

FSI301 An Architecture for Trade Capture and Regulatory Reporting

Amazon Web Services

For many securities organizations, post-trade processing is expensive, cumbersome, and time-consuming. This is in part due to the massive volumes of data required for processing a trade and the limited agility of the technology many organizations rely on today. In order to create efficiencies and move faster, many Financial Services organizations are working with AWS to implement post-trade solutions built with AWS’ storage services (S3 and Glacier) and big data capabilities (Athena, EMR, Redshift, and QuickSight ). In this session, AWS will walk through a trade capture and regulatory reporting solution that utilizes the aforementioned AWS services. We will also provide guidance around obtaining data-driven insights (from pixels to pictures), bolstering encryption with Amazon KMS, and maintaining transparency and control with Amazon CloudWatch and Amazon CloudTrail (which also helps meet SEC Rule 613 that requires the creation of comprehensive consolidated audit trails).

AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...

Amazon Web Services

Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), application programming interfaces (API), clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. Building scalable big data pipelines with automated extract-transform-load (ETL) and machine learning processes can address these limitations. JustGiving is the world’s largest social platform for online giving. In this session, we describe how we created several scalable and loosely coupled event-driven ETL and ML pipelines as part of our in-house data science platform called RAVEN. You learn how to leverage AWS Lambda, Amazon S3, Amazon EMR, Amazon Kinesis, and other services to build serverless, event-driven, data and stream processing pipelines in your organization. We review common design patterns, lessons learned, and best practices, with a focus on serverless big data architectures with AWS Lambda.

What's New with Big Data Analytics

Amazon Web Services

What's New with Big Data Analytics 亞馬遜 AWS 於 2018 年 11 月底在美國拉斯維加斯所舉辦的第七屆 AWS re:Invent 2018 大會，在 AWS 客戶、合作夥伴、媒體人士、產業分析師及 AWS 員工共襄盛舉下，與會人數再創新高，超過 5 萬人。會中 AWS 發布超過 20 款雲端方案，且一半以上專攻雲端 AI、機器學習、物聯網，包括對 SageMaker 強化更多進階功能，推出第一款專用的機器學習推論晶片、加入深度的機器學習運算法支援，及其他包括儲存、資料庫、混合雲、邊緣運算 IoT 等解決方案。而具備微型機器學習能力的迷你自駕遙控車 DeepRacer 的現身，驚人之舉不僅抓人眼球，深入客戶體驗的用心，更成功抓住全球使用者的心。為讓您與全球先進技術同步，共享最新趨勢資訊，解決您開發機器學習和發展 AIoT 所遇到的難題，AWS 台灣團隊將於 2019 年 1 月 31 日 (四) 舉辦《AWS re:Invent 2018 Recap 台北》，特別嚴選最適切國內諸位先進和企業需求的內容，從「技術創新」、「AIoT」兩大分組議程，發表 AWS 的新服務和新方案。大會除了邀請亞馬遜 AWS 大中華區首席雲計算企業顧問 (Principal Evangelist) 張俠博士分享 AWS 的解決方案藍圖外，眾多 AWS 資深專家也將分享包含機器學習、深度學習推理加速等新方案，完全託管的文件系統、資料庫，無伺服器、容器技術與安全性，以及大數據與分析、物聯網服務應用、儲存方案等最新技術。歡迎您親臨會場，全方位體驗 AWS 新服務將能為您創造的驚人創新之效益。

AWS re:Invent 2016: Cloud Monitoring - Understanding, Preparing, and Troubles...

Amazon Web Services

Applications running in a typical data center are static entities. Dynamic scaling and resource allocation are the norm in AWS. Technologies such as Amazon EC2, Docker, AWS Lambda, and Auto Scaling make tracking resources and resource utilization a challenge. The days of static server monitoring are over. In this session, we examine trends we’ve observed across thousands of customers using dynamic resource allocation and discuss why dynamic infrastructure fundamentally changes your monitoring strategy. We discuss some of the best practices we’ve learned by working with New Relic customers to build, manage, and troubleshoot applications and dynamic cloud services. Session sponsored by New Relic. AWS Competency Partner

AWS Innovate: Running Databases in AWS- Russell Nash

Amazon Web Services Korea

Building a data warehouse with AWS Redshift, Matillion and Yellowfin

Lynn Langit

ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale

Amazon Web Services

"With cloud maturity comes operational efficiencies and endless potential for innovation and business growth. However, the complexities of governing cloud infrastructure are impeding without the right strategy. Visibility, accountability, and actionable insights are some of the most invaluable considerations. The AWS cloud clearly enables convenience and cost savings for organizations that know how to leverage its full potential. Amazon EC2 Reserved Instances (RIs) in particular, present a tremendous opportunity when scaling to save significantly on capacity but there are many considerations to fully reaping the benefits of RIs. In this session, CloudCheckr CTO Patrick Gartlan will present issues that every organization runs into when scaling, provide best practices for how to combat them and help you show your boss how RIs help you save money and move faster. This session is brought to you by AWS Summit New York City sponsor, CloudCheckr. "

Module 2 - Datalake

Lam Le

Building Continuously Curated Ingestion Pipelines

Arvind Prabhakar

Open Source Big Data Ingestion - Without the Heartburn!

Pat Patterson

Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist. This session will survey the big data ingestion landscape, focusing on how open source tools such as Sqoop, Flume, Nifi and StreamSets can keep the data pipeline flowing.

What's hot

AWS re:Invent 2016: Fireside chat with Groupon, Intuit, and LifeLock on solvi...

Amazon Web Services

AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)

Amazon Web Services

Data Design for Microservices - DevDay Austin 2017 Day 2Amazon Web Services

Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services

When the Cloud is a Rockin: High Availability in Apache CloudStack

John Burwell

AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...

Amazon Web Services

Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...

Pat Patterson

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014

Chris Fregly

Big data on aws

Serkan Özal

Building A Modern Data Analytics Architecture on AWS

Amazon Web Services

Understanding AWS Managed Database and Analytics Services | AWS Public Sector...

Amazon Web Services

Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2Amazon Web Services

FSI301 An Architecture for Trade Capture and Regulatory Reporting

Amazon Web Services

AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...

Amazon Web Services

What's New with Big Data Analytics

Amazon Web Services

AWS re:Invent 2016: Cloud Monitoring - Understanding, Preparing, and Troubles...

Amazon Web Services

AWS Innovate: Running Databases in AWS- Russell Nash

Amazon Web Services Korea

Building a data warehouse with AWS Redshift, Matillion and Yellowfin

Lynn Langit

ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale

Amazon Web Services

Module 2 - Datalake

Lam Le

What's hot (20)

AWS re:Invent 2016: Fireside chat with Groupon, Intuit, and LifeLock on solvi...

AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)

Data Design for Microservices - DevDay Austin 2017 Day 2

Scalable Data Analytics - DevDay Austin 2017 Day 2

When the Cloud is a Rockin: High Availability in Apache CloudStack

AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...

Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014

Big data on aws

Building A Modern Data Analytics Architecture on AWS

Understanding AWS Managed Database and Analytics Services | AWS Public Sector...

Caching with DynamoDB and DAX - DevDay Austin 2017 Day 2

FSI301 An Architecture for Trade Capture and Regulatory Reporting

AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...

What's New with Big Data Analytics

AWS re:Invent 2016: Cloud Monitoring - Understanding, Preparing, and Troubles...

AWS Innovate: Running Databases in AWS- Russell Nash

Building a data warehouse with AWS Redshift, Matillion and Yellowfin

ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale

Module 2 - Datalake

Viewers also liked

Building Continuously Curated Ingestion Pipelines

Arvind Prabhakar

Open Source Big Data Ingestion - Without the Heartburn!

Pat Patterson

Building Data Pipelines with Spark and StreamSets

Pat Patterson

Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Metadata in upstream sources can ‘drift’ due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an Apache 2.0 licensed open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we’ll look at how SDC’s “intent-driven” approach keeps the data flowing, with a particular focus on clustered deployment with Spark and other exciting Spark integrations in the works.

Spark Summit EU talk by Pat Patterson

Spark Summit

Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud

Rick Bilodeau

Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...

DataStax

Cassandra is a perfect fit for consuming high volumes of time-series data directly from users, devices, and sensors. Sometimes, though, when we consume data from the real world, systematic and random errors creep in. In this session, we'll see how to use open source tools like RabbitMQ and StreamSets Data Collector with Cassandra features such as User Defined Aggregates to collect, cleanse and ingest variable quality data at scale. Discover how to combine the power of Cassandra with the flexibility of StreamSets to implement adaptive data cleansing. About the Speaker Pat Patterson Community Champion, StreamSets Pat Patterson has been working with Internet technologies since 1997, building software and working with communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for OpenSSO, while at Huawei he developed cloud storage infrastructure software. A developer evangelist at Salesforce, Pat focused on identity, integration and IoT. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.

Adaptive Data Cleansing with StreamSets and Cassandra

Pat Patterson

Presented at Cassandra Summit 2016. Cassandra is a perfect fit for consuming high volumes of time-series data directly from users, devices, and sensors. Sometimes, though, when we consume data from the real world, systematic and random errors creep in. In this session, we'll see how to use open source tools like RabbitMQ and StreamSets Data Collector with Cassandra features such as User Defined Aggregates to collect, cleanse and ingest variable quality data at scale. Discover how to combine the power of Cassandra with the flexibility of StreamSets to implement adaptive data cleansing.

Bad Data is Polluting Big Data

Streamsets Inc.

A global survey of more than 300 data management professionals conducted by independent research firm Dimensional Research® showed that enterprises of all sizes face challenges on a range of key data performance management issues from stopping bad data to keeping data flows operating effectively. In particular, 87 percent of respondents report flowing bad data into their data stores while just 12 percent consider themselves good at the key aspects of data flow performance management.

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Evan Chan

Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud

Streamsets Inc.

Data pipelines from zero to solid

Lars Albertsson

This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.

Kafka & Hadoop - for NYC Kafka Meetup

Gwen (Chen) Shapira

Building Scalable Big Data Pipelines

Christian Gügi

Expanding Your Data Warehouse with Tajo

Matthew (정재화)

A Beginner's Guide to Building Data Pipelines with Luigi

Growth Intelligence

In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies. Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients. In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.

Streamsets and spark

Hari Shreedharan

Ten canoes

BHS_Library

Building a Data Pipeline from Scratch - Joe Crobak

Hakka Labs

UX, ethnography and possibilities: for Libraries, Museums and Archives

Ned Potter

These slides are adapted from a talk I gave at the Welsh Government's Marketing Awards for the LAM sector, in 2017. It offers a primer on UX - User Experience - and how ethnography and design might be used in the library, archive and museum worlds to better understand our users. All good marketing starts with audience insight. The presentation covers the following: 1) An introduction to UX 2) Ethnography, with definitions and examples of 7 ethnographic techniques 3) User-centred design and Design Thinking 4) Examples of UX-led changes made at institutions in the UK and Scandinavia 5) Next Steps - if you'd like to try out UX at your own organisation

Designing Teams for Emerging Challenges

Aaron Irizarry

The technologies and people we are designing experiences for are constantly changing, in most cases they are changing at a rate that is difficult keep up with. When we think about how our teams are structured and the design processes we use in light of this challenge, a new design problem (or problem space) emerges, one that requires us to focus inward. How do we structure our teams and processes to be resilient? What would happen if we looked at our teams and design process as IA’s, Designers, Researchers? What strategies would we put in place to help them be successful? This talk will look at challenges we face leading, supporting, or simply being a part of design teams creating experiences for user groups with changing technological needs.

Viewers also liked (20)

Building Continuously Curated Ingestion Pipelines

Open Source Big Data Ingestion - Without the Heartburn!

Building Data Pipelines with Spark and StreamSets

Spark Summit EU talk by Pat Patterson

Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud

Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...

Adaptive Data Cleansing with StreamSets and Cassandra

Bad Data is Polluting Big Data

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud

Data pipelines from zero to solid

Kafka & Hadoop - for NYC Kafka Meetup

Building Scalable Big Data Pipelines

Expanding Your Data Warehouse with Tajo

A Beginner's Guide to Building Data Pipelines with Luigi

Streamsets and spark

Ten canoes

Building a Data Pipeline from Scratch - Joe Crobak

UX, ethnography and possibilities: for Libraries, Museums and Archives

Designing Teams for Emerging Challenges

Similar to Logging infrastructure for Microservices using StreamSets Data Collector

MapR-DB Elasticsearch IntegrationMapR Technologies

Pivotal cloud cache for .net microservices

Jagdish Mirani

In-memory caching is not new technology, but it takes on renewed significance with cloud-native, distributed application architectures. Modern day caching can alleviate the performance and availability challenges associated with cloud-native, distributed architectures. This presentation explores the unique characteristics of modern, distributed application architectures that make caching a vital part of the solution.

SAP HANA Native Application Development

SAP Technology

With SPS 11 for the SAP HANA platform, some major additions to SAP HANA extended application services are planned. On the JavaScript side, we plan to add Google V8 and full support for Node.js. We also plan to add a standard Java runtime (TomEE). The deployment infrastructure is planned to replace the current repository for SAP HANA. Come and see the features of the deployment infrastructure and the new XS Advanced run times, how design-time objects will now be managed in GIT and how to utilize the new container concept.

Boost Performance with Scala – Learn From Those Who’ve Done It!

Cécile Poyet

Scalding is a scala DSL for Cascading. Run on Hadoop, it’s a concise, functional, and very efficient way to build big data applications. One significant benefit of Scalding is that it allows easy porting of Scalding apps from MapReduce to newer, faster execution fabrics. In this webinar, Cyrille Chépélov, of Transparency Rights Management, will share how his organization boosted the performance of their Scalding apps by over 50% by moving away from MapReduce to Cascading 3.0 on Apache Tez. Dhruv Kumar, Hortonworks Partner Solution Engineer, will then explain how you can interact with data on HDP using Scala and leverage Scala as a programming language to develop Big Data applications.

Boost Performance with Scala – Learn From Those Who’ve Done It!

Cécile Poyet

Boost Performance with Scala – Learn From Those Who’ve Done It!

Hortonworks

Data Integration with Apache Kafka: What, Why, How

Pat Patterson

Presented at Orange County Advanced Analytics and Big Data Meetup, June 21 2019. Apache Kafka has fast become the dominant messaging technology for the enterprise; if you're a data scientist or data engineer and you have not yet worked with Kafka, that situation will likely change soon! In this session, Pat Patterson, director of evangelism at StreamSets, explains what Kafka is, why it has disrupted the previous generation of messaging products, and how you can use open source products to build dataflow pipelines with Kafka, without writing code.

Episode 2: Deploying Kubernetes at Scale

Mesosphere Inc.

Learn about the challenges the come with deploying and operating Kubernetes at scale and how the Mesosphere DC/OS Kubernetes integration helps solve them. During this presentation, Joerg Schad discusses: 1. Common challenges associated with getting a Kubernetes cluster up and running 2. The basics of running Kubernetes on Mesosphere DC/OS 3. How failure recovery works with the DC/OS-Kubernetes solution

Azure + DataStax Enterprise Powers Office 365 Per User Store

DataStax Academy

Cloud Foundry Diego, Lattice, Docker and more

cornelia davis

Colorado Cloud Foundry Meetup May 19, 2015 Lattice and Docker with Cornelia Davis Starting with a comparison of the current core runtime of the Cloud Foundry Elastic Runtime, to the new Diego rewrite, we take a tour through how linux containers can run a variety of image formats, including Docker. We talk about one way that you can get the Diego functionality in Lattice, a container scheduler that runs on a laptop or as a cluster in the cloud. We talk about ways of creating container images including Cloud Rocker and we draw it all together with a bunch of demos. Abstract from the meetup: What is Lattice (www.lattice.cf)? Lattice is an open source project for running containerized workloads on a cluster. A Lattice cluster is comprised of a number of Lattice Cells (VMs that run containers) and a Lattice Coordinator that monitors the Cells. Lattice includes built-in http load-balancing, a cluster scheduler, log aggregation with log streaming and health management. Lattice containers are described as long-running processes or temporary tasks. Lattice includes support for Linux Containers expressed either as Docker Images or by composing applications as binary code on top of a root file system. Lattice's container pluggability will enable other backends such as Windows or Rocket in the future.

Building a Stock Prediction system with Machine Learning using Geode, SpringX...

William Markito Oliveira

Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store

DataStax Academy

Leverage Kafka to build a stream processing platform

confluent

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

Spark Summit

GSJUG: Mastering Data Streaming Pipelines 09May2023

Timothy Spann

GSJUG: Mastering Data Streaming Pipelines 09May2023 https://www.meetup.com/futureofdata-princeton/events/293233881/ This is a repost from the Garden State Java Users Group Event. Join me at https://www.meetup.com/garden-state-java-user-group/events/293229660/ See: https://www.eventbrite.com/e/mastering-data-streaming-pipelines-tickets-627677218457?_ga=2.253257801.1787151623.1682868226-741104479.1678110925 Please note that registration via EventBrite is required to attend either in-person or online. We are happy to announce that Tim Spann will be our special guest for the May 9, 2023 meeting! Abstract: In this session, Tim will show you some best practices that he has discovered over the last seven years in building data streaming applications including IoT, CDC, Logs, and more. In his modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink, enhance events with NiFi enrichment. We build continuous queries against our topics with Flink SQL. We will show where Java fits in as sources, enrichments, NiFi processors and sinks. We hope to see you on May 9! Speaker Timothy Spann Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science. In this session, Tim will show you some best practices that he has discovered over the last seven years in building data streaming applications, including IoT, CDC, Logs, and more. In his modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there, we build streaming ETL with Apache Flink, enhance events with NiFi enrichment. We build continuous queries against our topics with Flink SQL. We will show where Java fits in as sources, enrichments, NiFi processors, and sinks. https://www.eventbrite.com/e/mastering-data-streaming-pipelines-tickets-627677218457?_ga=2.253257801.178

Private IaaS Cloud Provider

David Pasek

Real Time Data Processing Using Spark Streaming

Hari Shreedharan

IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...

In-Memory Computing Summit

With the advent of new open source platforms around Hadoop, NoSQL databases & in-memory databases, the data management stack in the enterprise is undergoing complete re-platforming. Batch and stream processing are two distinct data processing paradigms that need to be supported over this new stack. In this session I will talk about the importance of having a unified batch and stream processing engine and share my learning around - Sample use cases to that bring out the need to have a unified stream & batch processing engine Important features needed in the unified platform to tackle the above use cases.

Databases - State of the Union

Amazon Web Services

Cloudera Operational DB (Apache HBase & Apache Phoenix)

Timothy Spann

Similar to Logging infrastructure for Microservices using StreamSets Data Collector (20)

MapR-DB Elasticsearch Integration

Pivotal cloud cache for .net microservices

SAP HANA Native Application Development

Boost Performance with Scala – Learn From Those Who’ve Done It!

Data Integration with Apache Kafka: What, Why, How

Episode 2: Deploying Kubernetes at Scale

Azure + DataStax Enterprise Powers Office 365 Per User Store

Cloud Foundry Diego, Lattice, Docker and more

Building a Stock Prediction system with Machine Learning using Geode, SpringX...

Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store

Leverage Kafka to build a stream processing platform

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

GSJUG: Mastering Data Streaming Pipelines 09May2023

Private IaaS Cloud Provider

Real Time Data Processing Using Spark Streaming

IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...

Databases - State of the Union

Cloudera Operational DB (Apache HBase & Apache Phoenix)

More from Cask Data

Introducing a horizontally scalable, inference-based business Rules Engine fo...

Cask Data

Speaker: Nitin Motgi, Cask Big Data Applications Meetup, 09/20/2017 Palo Alto, CA More info here: http://www.meetup.com/BigDataApps/ Link to video: https://www.youtube.com/watch?v=FnQwDaKii2U About the talk: Business Rules are statements that describe business policies or procedures to process data. Rules engines or inference engines execute business rules in a runtime production environment, and have become commonplace for many IT applications. Except in the world of big data, where there has been a gap for a horizontally scalable, lightweight inference-based business rules engine for big data processing. In this session, you learn about Cask’s new business Rule Rngine built on top of CDAP, which is a sophisticated if-then-else statement interpreter that runs natively on big data systems such as Spark, Hadoop, Amazon EMR, Azure HDInsight and GCE. It provides an alternative computational model for transforming your data while empowering business users to specify and manage the transformations and policy enforcements. In his talk, Nitin Motgi, Cask co-founder and CTO, demonstrates this new, distributed rule engine and explain how business users in big data environments can make decisions on their data, enforce policies, and be an integral part of the data ingestion and ETL process. He also shows how business users can write, manage, deploy, execute and monitor business data transformation and policy enforcements. Check out http://bdam.io/ for more info on the Big Data Apps meetup!

About CDAP

Cask Data

Transaction in HBase, by Andreas Neumann, Cask

Cask Data

Title: Transactions in HBase Speaker: Andreas Neumann, Cask ApacheCon Big Data, Miami, FL May 18, 2017 Abstract: In the age of NoSQL, big data storage engines such as HBase have given up ACID semantics of traditional relational databases, in exchange for high scalability and availability. However, it turns out that in practice, many applications require consistency guarantees to protect data from concurrent modification in a massively parallel environment. In the past few years, several transaction engines have been proposed as add-ons to HBase: Three different engines, namely Omid, Tephra, and Trafodion were open-sourced within the Apache ecosystem alone. In this talk, Andreas Neumann will introduce and compare the different approaches from various perspectives including scalability, efficiency, operability and portability, and make recommendations pertaining to different use cases. Speaker Bio: Andreas Neumann develops big data software at Cask, and has formerly done so at places that are known for massive scale. He was the chief architect for Hadoop at Yahoo! and also for the foundational content management system that Yahoo! built on Hadoop. Previously he was a research engineer at Yahoo! and a search architect at IBM. Andreas holds a doctoral degree in computer science for his work on querying XML documents.

#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask

Cask Data

Speaker: Sagar Kapare, Cask Big Data Applications Meetup, 05/10/2017 Palo Alto, CA More info here: http://www.meetup.com/BigDataApps/ Link to video: https://youtu.be/mSKwjKvYUtI About the talk: The cost of maintaining a traditional Enterprise Data Warehouse (EDW) is skyrocketing as legacy systems buckle under the weight of exponentially growing data and increasingly complex processing needs. Hadoop, with its massive horizontal scalability, and CDAP which offers pre-built pipelines for EDW Offload in a drag&drop studio environment, can help. Sagar will demonstrate Cask’s solution, which shows how to build code-free, scalable, and enterprise-grade pipelines for delivering an easy-to-use and efficient EDW offload solution. He will also show how interactive data preparation, data pipeline automation, and fast querying capabilities over voluminous data can help unlock new use-cases.

"Who Moved my Data? - Why tracking changes and sources of data is critical to...

Cask Data

Speaker: Russ Savage, from Cask Big Data Applications Meetup, 09/14/2016 Palo Alto, CA More info here: http://www.meetup.com/BigDataApps/ Link to talk: https://youtu.be/4j78g3WvC4Y About the talk: As data lake sizes grow, and more users begin exploring and including that data in their everyday analysis, keeping track of the sources for data becomes critical. Understanding how a dataset was generated and who is using it allows users and companies to ensure their analysis is leveraging the most accurate and up to date information. In this talk, we will explore the different techniques available to keep track of your data in your data lake and demonstrate how we at Cask approached and attempted to mitigate this issue.

Building Enterprise Grade Applications in Yarn with Apache Twill

Cask Data

Speaker: Poorna Chandra, from Cask Big Data Applications Meetup, 07/27/2016 Palo Alto, CA More info here: http://www.meetup.com/BigDataApps/ Link to talk: https://www.youtube.com/watch?v=I1GLRXyQlx8 About the talk: Twill is an Apache incubator project that provides higher level abstraction to build distributed systems applications on YARN. Developing distributed applications using YARN is challenging because it does not provide higher level APIs, and lots of boiler plate code needs to be duplicated to deploy applications. Developing YARN applications is typically done by framework developers, like those familiar with Apache Flink or Apache Spark, who need to deploy the framework in a distributed way. By using Twill, application developers need only be familiar with the basics of the Java programming model when using the Twill APIs, so they can focus on solving business problems. In this talk I present how Twill can be leveraged and an example of Cask Data Application Platform (CDAP) that heavily uses Twill for resource management.

Webinar: What's new in CDAP 3.5?

Cask Data

Cask Webinar Date: 08/10/2016 Link to video recording: https://www.youtube.com/watch?v=XUkANr9iag0 In this webinar, Nitin Motgi, CTO of Cask, walks through the new capabilities of CDAP 3.5 and explains how your organization can benefit. Some of the highlights include: - Enterprise-grade security - Authentication, authorization, secure keystore for storing configurations. Plus integration with Apache Sentry and Apache Ranger. - Preview mode - Ability to preview and debug data pipelines before deploying them. - Joins in Cask Hydrator - Capabilities to join multiple data sources in data pipelines - Real-time pipelines with Spark Streaming - Drag & drop real-time pipelines using Spark Streaming. - Data usage analytics - Ability to report application usage of data sets. - And much more!

Transactions Over Apache HBase

Cask Data

Transactions for Apache HBase™: Apache Tephra provides globally consistent transactions on top of Apache HBase. While HBase provides strong consistency with row- or region-level ACID operations, it sacrifices cross-region and cross-table consistency in favor of scalability. This trade-off requires application developers to handle the complexity of ensuring consistency when their modifications span region boundaries. By providing support for global transactions that span regions, tables, or multiple RPCs, Tephra simplifies application development on top of HBase, without a significant impact on performance or scalability for many workloads.

ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...

Cask Data

TITLE: ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating) SPEAKER: Poorna Chandra, Cask Data DATE: May 25, 2016 LOCATION: PhoenixCon, San Francisco CA http://www.meetup.com/SF-Bay-Area-Apache-Phoenix-Meetup/events/230545182/ TALK ABSTRACT: This talk is about how Apache Phoenix added support for ACID transactions using Apache Tephra™ (incubating), an open source transaction engine on top of Apache HBase. To start off with, the talk will examine the need for Phoenix data operations to be transactional. The talk will then discuss how Tephra implements transactional semantics using Optimistic Concurrency Control by giving an overview of the transactional model of Tephra along with the high level architecture. The talk will then describe the details of Phoenix integration with Tephra, and present performance some benchmark results for Phoenix operations with transactions. The talk will conclude with discussion on some challenges with scaling Tephra and potential solutions.

Introducing Athena: 08/19 Big Data Application Meetup, Talk #3

Cask Data

Speaker: Yuanchi Ning from Uber Big Data Applications Meetup, 08/19/2015 Palo Alto, CA More info here: http://www.meetup.com/BigDataApps/ Link to talk: https://www.youtube.com/watch?v=SY1YSU8cFLI About the talk: Athena is a stream processing platform for Uber's near real time analytics applications, built using Samza. We will be discussing some of the existing and upcoming use cases and how they impact the Uber partners / riders. The talk will go through the tooling built around Samza for easier user on-boarding - such as deployment manager, integration with typesafe config system, unit test framework, Graphite integration, metric whitelisting and so on. We'll also go over some of the issues observed during this process.

NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015

Cask Data

HBaseCon 2015 May 7 San Francisco This talk at HBaseCon was given by Poorna Chandra from Cask and Alan Steckley from Salesforce.com Here's a short summary of the talk: Salesforce is building a new service, code-named Webhooks, that enables our customers' own systems to respond in near real-time to system events and customer behavioral actions from the Salesforce Marketing Cloud. The application should process millions of events per day to address the current needs and scale up to billions of events per day for future needs, so horizontal scalability is a primary concern. In this talk, they discussed how Webhooks is built using HBase for data storage and Cask Data Application Platform (CDAP), an open source framework for building applications on Hadoop.

Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag

Cask Data

HBase Meetup @ Cask HQ 09/25

Cask Data

More from Cask Data (13)

Introducing a horizontally scalable, inference-based business Rules Engine fo...

About CDAP

Transaction in HBase, by Andreas Neumann, Cask

#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask

"Who Moved my Data? - Why tracking changes and sources of data is critical to...

Building Enterprise Grade Applications in Yarn with Apache Twill

Webinar: What's new in CDAP 3.5?

Transactions Over Apache HBase

ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...

Introducing Athena: 08/19 Big Data Application Meetup, Talk #3

NRT Event Processing with Guaranteed Delivery of HTTP Callbacks, HBaseCon 2015

Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag

HBase Meetup @ Cask HQ 09/25

Recently uploaded

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Globus

COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.

Strategies for Successful Data Migration Tools.pptx

varshanayak241

Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.

Globus Connect Server Deep Dive - GlobusWorld 2024

Globus

Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...

informapgpstrackings

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...

Hivelance Technology

Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL

Natan Silnitsky

In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey. Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience. Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system. Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.

Explore Modern SharePoint Templates for 2024

Sharepoint Designs

A Comprehensive Look at Generative AI in Retail App Testing.pdf

kalichargn70th171

Cyaniclab : Software Development Agency Portfolio.pdf

Cyanic lab

CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.

Into the Box 2024 - Keynote Day 2 Slides.pdf

Ortus Solutions, Corp

Providing Globus Services to Users of JASMIN for Environmental Data Analysis

Globus

JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Globus

The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.

How to Position Your Globus Data Portal for Success Ten Good Practices

Globus

Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.

Lecture 1 Introduction to games development

abdulrafaychaudhry

Corporate Management | Session 3 of 3 | Tendenci AMS

Tendenci - The Open Source AMS (Association Management Software)

Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have. For more Tendenci AMS events, check out www.tendenci.com/events

Vitthal Shirke Microservices Resume Montevideo

Vitthal Shirke

De mooiste recreatieve routes ontdekken met RouteYou en FME

Jelle | Nordend

Quarkus Hidden and Forbidden Extensions

Max Andersen

Large Language Models and the End of Programming

Matt Welsh

Software Testing Exam imp Ques Notes.pdf

MayankTawar1

Recently uploaded (20)

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Strategies for Successful Data Migration Tools.pptx

Globus Connect Server Deep Dive - GlobusWorld 2024

Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL

Explore Modern SharePoint Templates for 2024

A Comprehensive Look at Generative AI in Retail App Testing.pdf

Cyaniclab : Software Development Agency Portfolio.pdf

Into the Box 2024 - Keynote Day 2 Slides.pdf

Providing Globus Services to Users of JASMIN for Environmental Data Analysis

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

How to Position Your Globus Data Portal for Success Ten Good Practices

Lecture 1 Introduction to games development

Corporate Management | Session 3 of 3 | Tendenci AMS

Vitthal Shirke Microservices Resume Montevideo

De mooiste recreatieve routes ontdekken met RouteYou en FME

Quarkus Hidden and Forbidden Extensions

Large Language Models and the End of Programming

Software Testing Exam imp Ques Notes.pdf

Logging infrastructure for Microservices using StreamSets Data Collector

1. Logging infrastructure for MicroServices using StreamSets Data Collector Logging Infrastructure for microservices using StreamSets Data Collector Presenter: Virag Kothari Software Engineer at StreamSets

2. Open-Source Continuous Ingest

3. © 2015 StreamSets, Inc. All rights reserved. About StreamSets ● Headquartered in San Francisco, CA ● Deep expertise in enterprise data management and integration ○ Girish Pancha, CEO (Formerly Chief Product Officer at Informatica) ○ Arvind Prabhakar, CTO (Formerly Director, Engineering for Integration at Cloudera) ○ Team includes Apache PMC members for Flume, Sqoop, Hadoop, Oozie, Hive, Storm

4. © 2015 StreamSets, Inc. All rights reserved. Containerized services Run batch jobs, application jobs, microservices Logging is key in dynamic environments HBase/Cassandra HDFS/S3 Elasticsearch Docker Container Docker Container Kafka Application Flume/Logstash

6. © 2015 StreamSets, Inc. All rights reserved. StreamSets Data Collector (SDC) Pipeline Origin (Log Source) Processor Destination (Kafka) On success Kafka/Write to File On error Application Docker container

10. © 2015 StreamSets, Inc. All rights reserved. Spark Streaming + Kafka Direct Approach One to one mapping between Kafka and RDD partitions Allocate executors equal to Kafka partitions Multiple tasks within executor Kafka partition RDD partition SDC

12. © 2015 StreamSets, Inc. All rights reserved. Spark on Mesos Mesos not a framework manager REST endpoint provided by Spark to manage the Mesos framework No Distributed Cache Fault-tolerance through pipeline-level retries

Logging infrastructure for Microservices using StreamSets Data Collector

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Logging infrastructure for Microservices using StreamSets Data Collector

Similar to Logging infrastructure for Microservices using StreamSets Data Collector (20)

More from Cask Data

More from Cask Data (13)

Recently uploaded

Recently uploaded (20)

Logging infrastructure for Microservices using StreamSets Data Collector