Real Time API delivering data @ Scale

•Download as PPT, PDF•

1 like•466 views

This post talks about various architectural decision and their driving reasons which was taken to build an REST API which need to deliver large amount of reporting data.

Software

Akash Mishra
Real Time API
delivering Data @ Scale

Agenda
API Overview
Key System Requirement
Big Data System Vs RDBMS
Architecture
Data Flow
Questions?

API Overview
API details
REST based API
Partners can request for various types of reports
Each reports has data in order of T.B's
Sample Request
?start-date=2012-10-01&end-date=2012-10-
29&partner=1&aggregate-by=state,city
Response
Zip file [Size in order of 10-30 M.B]

Key System Requirement
Interactive Filtering Query
– Partner can filter data on various parameter.
Real Time Response
– SLA of 1-3 min.
Security
Extremely private and confidential data.
Need to go through an audit by external vendor
Scalability
Only more machine for more customer

Big Data System Vs Relational Data System
Large Amount of Data [In order of T.B's ]
Hadoop/Hive
RDBMS
Real Time Interactive Filtering/Querying
Hadoop/Hive
RDBMS
Join's between large tables [ millions X millions X millions ]
– Hadoop/Hive
– RDBMS

Big Data System Vs Relational Data System
Access/Security Control
Hadoop/Hive
RDBMS
Resilient to Hardware failure and Auto Scaling
Hadoop/Hive
RDBMS
Fast read operation's
– Hadoop/Hive
– RDBMS

Data Flow
De-normalization on Hadoop/Hive
Time: 3hrs
#Records: 230m

Data Flow
Dynamic partitioning on Hadoop/Hive
# Buckets 15
#Records: 230m

Data Flow
Sqoop Export
#Records: 230m
Size: 1 T.B

Data Flow
Security Control in RDBMS
Strong User authentication mechanism.
Restricted access to each user on database and table level
Each partner has specific user and associated tables
No cross-referencing of data across [table] partner.

Data Flow
Java API
Common Pattern [Streaming]
• Read a bunch of records from DB.
• Process records.
• Stream back to client.
Avoiding creating unnecessary objects
• Java heap memory exception because of using String in
place of Char Array.

The document discusses API and big data solutions using WSO2 products. It begins by introducing WSO2 and its open source middleware platform. It then defines APIs and API management, describing how APIs can be used for both public and internal consumption. Next, it covers big data concepts like collecting, storing, and analyzing large datasets. It proposes several patterns for integrating APIs and big data, such as using API analytics for monitoring and control, billing and metering, targeted recommendations, and exposing datasets and analytics via APIs. Finally, it provides an example use case of using API and big data products to trigger alerts when new API versions become slower.

Analytics Patterns for Your Digital Enterprise

Sriskandarajah Suhothayan

This document discusses analytics patterns and solutions using WSO2 Data Analytics Server (DAS). It covers topics like real-time processing patterns including transformation, temporal aggregation, alerts and thresholds, and event correlation. It also discusses incremental processing patterns, predictive analytics using machine learning models, and smart analytics solutions for industries like banking/finance, eCommerce, fleet management, energy, and healthcare. Key differentiations of WSO2 DAS highlighted are its real-time analytics capabilities, SQL-like query language without code compilation, incremental processing, intelligent decision making with machine learning, rich connectors, and high performance with low infrastructure costs.

Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...

HostedbyConfluent

Managing Apache Kafka sometimes could be cumbersome, and that's something that we would like to avoid, especially for developers and data engineers that need to build and develop data pipelines. Luckily, Kubernetes and Kafka's combination helps us reduce everyday tasks tremendously by adding myriad capabilities to lessen the complexity of managing clusters. Kafka Connect and KSQLDB are a fantastic combo to add to your streaming stack. These two soldiers can facilitate data acquisition and processing and also provide outstanding real-time ETL capabilities. But what if you need an OLAP datastore to answer complex queries with a low-latency response, that's where Apache Pinot comes to play. At this session, you're going to learn: - Effective Kafka deployment on Kubernetes - How to properly configure Kafka Connect and KSQLDB - Integrate Apache Pinot to answer OLAP queries

Kafka & InfluxDB: BFFs for Enterprise Data Applications | Russ Savage, Influx...

HostedbyConfluent

Modern data processing applications built on Kafka and InfluxDB deliver the performance, reliability, and flexibility that customers need for robust real-time data pipeline solutions. As the saying goes, the pipeline is greater than the sum of its Kafka and InfluxDB parts. In this session, Russ Savage, Director of Product Management at InfluxData will discuss basic concepts of integrating Kafka and InfluxDB while highlighting how companies are creating fault-tolerant, scalable and fast data pipelines with the power of InfluxDB and Kafka.

Should we manage events like APIs? | Alan Chatt and Kim Clark, IBM

HostedbyConfluent

APIs have become ubiquitous as a way of exposing the capabilities of the enterprise both internally and externally. However, are APIs alone enough? There is a strong resurgence in interest in asynchronous communication and event driven architecture. Applications want to receive events immediately so they can respond in real time, and furthermore they also want the benefit of being decoupled from the availability and performance characteristics of the systems providing that data. However, whilst the way that APIs are socialised, exposed, versioned etc. is well matured in the form of API management technology. We are now on the cusp of seeing first class support for event endpoint management to provide the same sophistication for discovering, exposing and consuming events.

Real-Time Market Data Analytics Using Kafka Streams

confluent

(Lei Chen, Bloomberg, L.P.) Kafka Summit SF 2018 At Bloomberg, we are building a streaming platform with Apache Kafka, Kafka Streams and Spark Streaming to handle high volume, real-time processing with rapid derivative market data. In this talk, we’ll share the experience of how we utilize Kafka Streams Processor API to build pipelines that are capable of handling millions of market movements per second with ultra-low latency, as well as performing complex analytics like outlier detection, source confidence evaluation (scoring), arbitrage detection and other financial-related processing. We’ll cover: -Our system architecture -Best practices of using the Processor API and State Store API -Dynamic gap session implementation -Historical data re-processing practice in KStreams app -Chaining multiple KStreams apps with Spark Streaming job

Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...

HostedbyConfluent

Hermes, Germany's largest post-independent logistics service provider for deliveries, had one main goal—make faster and smarter data-driven business decisions. But with high volumes of diverse and disparate data, how can you effectively leverage it as an asset for real-time insights and business intelligence? During this session, Hermes will share their data challenges and how HVR's high volume data replication capabilities enabled Hermes to securely and seamlessly integrate data into Kafka for real-time decision-making and greater visibility into the entire logistics process.

TBD Data Governance | David Araujo and Michael Agnich, Confluent

HostedbyConfluent

Organisations are becoming Event Driven based on streaming technologies and adopting Data Mesh and Event Mesh architectures. As this becomes pervasive, so do the challenges around runtime governance and lifecycle management. For example, do you know what streams exist, who is producing and consuming them? What is the effect of upstream changes? How is this information kept up to date, and how do people collaborate efficiently across distributed teams and environments? Ever wish you had a way to view and visualize graphically the relationships between schemas, topics and applications? In this talk we will show you how to do that and get more value from your Kafka Streaming infrastructure using an Event Portal - an API portal specialized for event streams and publish/subscribe patterns. Join us to see how you can discover event streams from your Kafka clusters, import them to a catalog to see alongside other enterprise event streams and leverage code gen capabilities to ease development.

[WSO2Con EU 2017] Open Interoperability of WSO2 Analytics Platform

WSO2

This document discusses how WSO2's analytics platform meets key expectations for interoperability. It outlines the typical components of an analytics solution, including collecting data from various sources using different protocols and formats, analyzing the data through integration with existing data stores and models, and communicating results through multiple transports and formats for alerting and storage. The document then provides examples of real-world use cases demonstrating interoperability in areas like receiving data from different sources, integrating with existing systems and data stores, and extending capabilities. Overall, the document promotes WSO2's analytics platform as being interoperable through its ability to easily integrate at various steps of the analytics process.

What does an event mean? Manage the meaning of your data! | Andreas Wombacher...

HostedbyConfluent

Van Oord, a 150 year old family owned business, build windmill parks in the sea, lay cables on sea surface, dredging, as well as infrastructure (Dike, etc) operates world-wide, often facilitating self-owned specialized vessels. A well-known prestigious project is the creation of the palm island at the coast of Dubai. Data Management in Van Oord is still in its infancy. The current operation is based on bilateral data exchange, without an Enterprise Service Bus or mayor data warehouse infrastructure. In 2020 Van Oord started a PoC with Confluent Kafka, executing a wide range of uses cases and requirements, followed by the formal program implementing a sustainable data platform. Data owners are publishing an information product, i.e. a set of Kafka topics to communicate change (a la CDC) and topics for sharing state of a data source (Kafka tables). The information product owner is responsible for granting access, assuring data quality, data linage and governance. The set of all information products forms the enterprise data model. This talk outlines why Van Oord requires data governance and enterprise architecture models integrated with Confluent Kafka, and demo how an open-source based data governance tool is integrated with Confluent Kafka to fulfil these requirements.

Modernizing with microservices and fast data

Patrick Di Loreto

The document summarizes Patrick Di Loreto's presentation on modernizing a data platform with microservices and fast data. Some key points: - The platform processes large amounts of data (160TB daily) in real-time from various sources to support millions of simultaneous customers. - Omnia is the distributed data management platform built on reactive principles with Chronos, Fates, NeoCortex and Hermes layers to ingest, store, process and serve data. - Chronos collects streaming data and stores it in Kafka. Fates builds timelines and views using batch processing. NeoCortex performs real-time analytics using Spark, Akka streams or lambdas. Hermes serves the data

Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...

HostedbyConfluent

Server-Sent Events (SSE) is a server push technology where clients receive automatic server updates through the secure http connection. SSE can be used in apps like live stock updates, that use one way data communications and also helps to replace long polling by maintaining a single connection and keeping a continuous event stream going through it. We used a simple Kafka producer to publish messages onto Kafka topics and developed a reactive Kafka consumer by leveraging Spring Webflux to read data from Kafka topic in non-blocking manner and send data to clients that are registered with Kafka consumer without closing any http connections. This implementation allows us to send data in a fully asynchronous & non-blocking manner and allows us to handle a massive number of concurrent connections. We’ll cover: •Push data to external or internal apps in near real time •Push data onto the files and securely copy them to any cloud services •Handle multiple third-party apps integrations

[WSO2Con EU 2018] Decentralized Data Architectures

WSO2

The technology world is rapidly moving towards microservices and there are well documented best practices on how to do so. However, data persistence has always been a challenge for most brownfield or greenfield microservices initiatives. Concepts such as ACID properties need to be considered when moving to a decentralized model. Data consistency is often a challenge that affects the overall service. This presentation takes a pragmatic look at a decentralized data architecture and how it aides a move towards a true microservices model. We also look at some of the latest data initiatives such as streaming data for persistence

Scalable Data Management for Kafka and Beyond | Dan Rice, BigID

HostedbyConfluent

Data in motion has changed both the scale and scope of data and analytics - enabling organizations to capture more information and use it more effectively. But to get the most value from it - you need to know what’s there, make it risk aware, and take action on it. In this session, you’ll learn how to leverage modern ML-augmented data management solutions to automatically find, identify, and classify sensitive data across Spark, Databricks, and beyond - and how to apply policies for compliance and risk mitigation to get the most value from our data.

Monitoreo sencillo de la infraestructura, de la ingesta a la visualización

Elasticsearch

La visibilidad sobre la infraestructura es un elemento esencial, independientemente de que sea en tus propias máquinas o en la nube, virtualizada, en contenedores, o en un entorno híbrido. El Elastic (ELK) Stack, históricamente conocido por sus capacidades de logging, permite también monitorear tus métricas con el mismo rendimiento Descubre cómo facilitamos la ingesta de datos mediante cientos de integraciones prediseñadas, mejoramos tu día a día con alertas y machine learning, y mejoramos tus visualizaciones con nuevas herramientas desarrolladas para los casos de uso de monitoreo.

Testing Event Driven Architectures: How to Broker the Complexity | Frank Kilc...

HostedbyConfluent

This document discusses testing event-driven architectures. It begins by defining common event-driven architecture patterns like event notifications and event sourcing. It then discusses brokering the complexity of event-driven architectures by describing how events are communicated between producers and consumers via channels. The document outlines what information should be included in events like payloads and headers. It also discusses the difference between orchestration and choreography in event-driven systems. It provides an example of how events can be used to mediate changes within a system using order validation. Finally, it demonstrates how to test event-driven architectures using specifications and discusses accelerating API quality through testing tools that support multiple protocols and definitions.

Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...

confluent

(Dmitry Milman + Ankur Kaneria, Express Scripts) Kafka Summit SF 2018 Building cloud-based microservices can be a challenge when the system of record is a relational database residing on an on-premise mainframe. The challenge lies in the ability to efficiently and cost-effectively access the ever-increasing amount of data. Express Scripts is reimagining its data architecture to bring best-in-class user experience and provide the foundation of next-generation applications. This talk will showcase how Kafka plays a key role within Express Scripts’ transformation from mainframe to a microservice-based ecosystem, ensuring data integrity between two worlds. It will discuss how change data capture (CDC) is leveraged to stream data changes to Kafka, allowing us to build a low-latency data sync pipeline. We will describe how we achieve transactional consistency by collapsing all events that belong together onto a single topic, yet have the ability to scale out to meet the real time SLAs and low-latency requirements through means of partitions. We will share our Kafka Streams configuration to handle the data transformation workload. We will discuss our overall Kafka cluster footprint, configuration and security measures. Express Scripts Holding Company is an American Fortune 100 company. As of 2018, the company is the 25th largest in the U.S. as well as one of the largest pharmacy benefit management organizations in the U.S. Customers rely on 24/7 access to our services, and need the ability to interact with our systems in real time via various channels such as web and mobile. Sharing our mainframe t0 microservices migration journey, our experiences and lessons learned would be beneficial to other companies venturing on a similar path.

Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google

HostedbyConfluent

The data that organizations are required to analyze in order to make informed decisions is growing at an unprecedented rate. Companies have to capture the window of opportunity and become not just data driven, but event driven. In this talk, we will talk around addressing these issues and look into ways to bridge the on-premise kafka deployments with GCP stack for different use cases and personas. This will be followed by architecture examples on How do you deploy kafka and integrate with the rest of the GCP stack.

Accelerating Innovation with Apache Kafka, Heikki Nousiainen | Heikki Nousiai...

HostedbyConfluent

Being a pioneer in the interactive gaming industry, SONY PlayStation has played a vital role in implementing technological advancements thus help bringing global video gaming community together. With the recent launch of next generation console PS-5 into the market by partnering with thousands of game developers and millions of video gamers across the globe, humongous volumes of data generation in playstation servers is quite inevitable. This presentation talks about how we leveraged big data technologies along with Apache Kafka to solve some of the realtime data analytical problems. Two important case studies we carryout recently are: ""Competitive pricing analysis of game titles across online video game marketplaces"" & ""understand the gamers sentiment by streaming data from social feeds and perform NLP"" Along with Apache Kafka, the technologies that we have used to architect the solution are: REST API, ZooKeeper, D3.js visualization, DoMo, Python, SQL, NLP, AWS Cloud & JSON.

Why Kafka Works the Way It Does (And Not Some Other Way) | Tim Berglund, Conf...

HostedbyConfluent

Studying the ""how"" of Kafka makes you better at using Kafka, but studying its ""whys"" makes you better at so much more. In looking at the tradeoffs behind a system like Kafka, we learn to reason more clearly about distributed systems and to make high-stakes technology adoption decisions more effectively. These are skills we all want to improve! In this talk, we'll examine trade-offs on which our favorite distributed messaging system takes opinionated positions: - Whether to store data contiguously or using an index - How many storage tiers are best? - Where should metadata live? - And more. It's always useful to dissect a modern distributed system with the goal of understanding it better, and it's even better to learn to deeper architectural principles in the process. Come to this talk for a generous helping of both.

Event-Driven Microservices with Apache Kafka, Kafka Streams and KSQL

Kai Wähner

Building Event-Driven Microservices with Stateful Streams - Apache Kafka, Kafka Streams, KSQL, and more… Event Driven Services come in many shapes and sizes from tiny event-driven functions that dip into an event stream, right through to heavy, stateful services which can facilitate request response. This talk makes the case for building this style of system using Stream Processing tools, defining a microservices architecture and leveraging Apache Kafka ecosystem including Kafka Streams and KSQL. We also walk through a number of patterns for how we actually put these things together to enable independent teams and autonomous development of microservices. Kudos to my colleagues Ben and Jay who created many of the slides.

SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)

Surendar S

Kafka Summit NYC 2017 - The Rise of the Streaming Platform

confluent

The document discusses the rise of streaming platforms and Apache Kafka. It describes how Fortune 500 companies and major global banks, insurance, and telecom companies are adopting streaming platforms. It then discusses the technical capabilities of streaming platforms, including their abilities to store, process, and publish/subscribe to data in real-time at large scales. Finally, it envisions the future of streaming platforms and their potential to support a wide range of applications from databases and key-value stores to monitoring, search, data warehousing, Hadoop, stream processing, and real-time analytics on a single, open platform.

How to Define and Share your Event APIs using AsyncAPI and Event API Products...

HostedbyConfluent

Defining Asynchronous APIs and sharing them with your developer community is the most effective way for internal app developers and partners to create new services using real-time event streams. But how do you do it? What specification do you use to define the APIs? What are the best practices for sharing them with the developer community? What framework can you use to code? And what’s next? How do you manage the lifecycle of these APIs? In this talk, Fran Mendez, founder of AsyncAPI and Jonathan Schabowsky, Solace CTO Architect will introduce you to the AsyncAPI specification and show you two different methods to define and share your event APIs, quickly get up to speed, and more. You will learn how to create a Kafka application using asynchronous APIs in minutes!

Real Use Cases - Pentaho & Big Data Ecosystem

Xpand IT

Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...

HostedbyConfluent

Companies are increasingly becoming software-driven, requiring new approaches to software architecture and data integration. The "data mesh" architectural pattern decentralizes data management by organizing it around domain experts and treating data as products that can be accessed on-demand. This helps address issues with centralized data warehouses by evolving data modeling with business needs, avoiding bottlenecks, and giving autonomy to domain teams. Key principles of the data mesh include domain ownership of data, treating data as self-service products, and establishing federated governance to coordinate the decentralized system.

IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...

Kai Wähner

I did a webinar with Confluent's partner Expero about "Apache Kafka and Machine Learning for Real Time Supply Chain Optimization". This is a great example for anybody in automation industry / Industrial IoT (IIoT) like automotive, manufacturing, logistics, etc. We explain how a real time event streaming platform can integrate in real time with the legacy world and proprietary IIoT protocols (like Siemens S7, Modbus, Beckhoff ADS, OPC-UA, et al). You can process the data at scale and then ingest it into a modern database (like AWS S3, Snowflake or MongoDB) or analytic / machine learning framework (like TensorFlow, PyTorch or Azure Machine Learning Service).

Information Virtualization: Query Federation on Data Lakes

DataWorks Summit

This document discusses information virtualization and query federation on data lakes. It provides examples of how information virtualization hides the complexity of integrating data from different sources and allows queries to span multiple data repositories. It also discusses best practices for query federation, including avoiding complex joins across many systems and keeping statistics up to date on all tables in a federated system.

Apache Druid 101

Data Con LA

Data Con LA 2020 Description Apache Druid is a cloud-native open-source database that enables developers to build highly-scalable, low-latency, real-time interactive dashboards and apps to explore huge quantities of data. This column-oriented database provides the microsecond query response times required for ad-hoc queries and programmatic analytics. Druid natively streams data from Apache Kafka (and more) and batch loads just about anything. At ingestion, Druid partitions data based on time so time-based queries run significantly faster than traditional databases, plus Druid offers SQL compatibility. Druid is used in production by AirBnB, Nielsen, Netflix and more for real-time and historical data analytics. This talk provides an introduction to Apache Druid including: Druid's core architecture and its advantages, Working with streaming and batch data in Druid, Querying data and building apps on Druid and Real-world examples of Apache Druid in action Speaker Matt Sarrel, Imply Data, Developer Evangelist

What's hot

Mesh-ing around with Streams across the Enterprise | Phil Scanlon, Solace

HostedbyConfluent

[WSO2Con EU 2017] Open Interoperability of WSO2 Analytics Platform

WSO2

What does an event mean? Manage the meaning of your data! | Andreas Wombacher...

HostedbyConfluent

Modernizing with microservices and fast data

Patrick Di Loreto

Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...

HostedbyConfluent

[WSO2Con EU 2018] Decentralized Data Architectures

WSO2

Scalable Data Management for Kafka and Beyond | Dan Rice, BigID

HostedbyConfluent

Monitoreo sencillo de la infraestructura, de la ingesta a la visualización

Elasticsearch

Testing Event Driven Architectures: How to Broker the Complexity | Frank Kilc...

HostedbyConfluent

Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...

confluent

Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google

HostedbyConfluent

Accelerating Innovation with Apache Kafka, Heikki Nousiainen | Heikki Nousiai...

HostedbyConfluent

Why Kafka Works the Way It Does (And Not Some Other Way) | Tim Berglund, Conf...

HostedbyConfluent

Event-Driven Microservices with Apache Kafka, Kafka Streams and KSQL

Kai Wähner

SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)

Surendar S

Kafka Summit NYC 2017 - The Rise of the Streaming Platform

confluent

How to Define and Share your Event APIs using AsyncAPI and Event API Products...

HostedbyConfluent

Real Use Cases - Pentaho & Big Data Ecosystem

Xpand IT

Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...

HostedbyConfluent

IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...

Kai Wähner

What's hot (20)