A brief primer by Nick Elser on how Instacart uses ElasticSearch, and, how Instacart leverages Druid & Kinesis to instrument ElasticSearch at scale on the client side to debug multi-tenant performance issues and problematic queries and keep one of their most important data stores humming.
A brief primer by Nick Elser on how Instacart uses ElasticSearch, and, how Instacart leverages Druid & Kinesis to instrument ElasticSearch at scale on the client side to debug multi-tenant performance issues and problematic queries and keep one of their most important data stores humming.
NoSQL Riak MongoDB Elasticsearch - All The Same?Eberhard Wolff
Gives a general introduction to NoSQL and modeling data with JSON. Goes on to compare MongoDB, Riak and Elasticsearch - that seem to be the same at first sight but are in fact pretty different. Presented at JavaLand.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
Apache Flink features two APIs which are based on relational algebra, a SQL interface and the so-called Table API, which is a LINQ-style API available for Scala and Java. Relational APIs are interesting because they are easy to use and queries can be automatically optimized and translated into efficient runtime code. Flink offers both APIs for streaming and batch data sources. This talk takes a look under the hood of Flink’s relational APIs. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries of both APIs into the same representation, leverages Apache Calcite to optimize them, and generates runtime code for efficient execution. Finally, the slides discuss potential improvements and give an outlook for future extensions and features.
Pipeline designer allows users to author their processes and provision them on Falcon. This should make building applications on Falcon over Hadoop fairly trivial. Falcon has the ability to operate with HCatalog tables natively. This means that there is a one to one correspondence between a Falcon feed and an HCatalog table. Between the feed definition in Falcon and the underlying table definition in HCatalog, there is adequate metadata about the data stored underneath. This data (sets of them) can then be operated over by a collection of transformations to extract more refined dataset/feed. This logic (currently represented via Oozie workflow / pig scripts / map-reduce jobs) is typically represented through the Falcon process. In this talk we walk through the details of the pipeline designer and the current state of this feature.
A presentation delivered to the Auckland Atlassian User Group on the 22nd February, 2012 - where we explore from a high level the Atlassian developer ecosystem, i.e. why you want to develop for it, how to develop for it etc.
This covered a number of things which are also landing in Jira 5 and above only i.e. activity streams and remote links.
http://www.meetup.com/Akl-AUG/events/47434772/
Designing a social network offers some exciting challenges to engineers. The system needs to operate at scale, to provide a responsive user experience and to be able to inspect user activity in order to both generate new content and improve how the existing content is delivered.
Event Driven Architectures are particularly suitable to handle these kind of challenges and highly scalable messaging systems such as Apache Kafka have been designed specifically to support the requirements of modern high volume applications.
In this talk we are describing how the Crowdmix back-end has been designed as an Event Based system running on top of Kafka. We are going to present the overall system architecture and discuss in more detail some of the different sub components processing those events in different fashions, from streaming based processing to batch processing passing through a lambda-style batch and stream cooperation.
We are going to conclude describing some lessons learned from our one-year journey in implementing and operating the system
Functional languages and Scala in particular have been attracting the interest of a growing number of developers in the latest years. In particular, Scala created a core of highly-motivated supporters in the so-called Reactive System area and in the Big Data domain.
In this talk we are going to describe how we evolved in CrowdMix from an initial implementation of the system based on Java8 to one mostly implemented in Scala. The talk will also describe how we evolved from a traditional service design to a highly scalable reactive system.
Although this could be possibly considered a less-common and simpler case than the majority of monolith legacy Java system conversions, there is a set of lessons learned that we would like to share.
In particular we will describe:
- How we approached the transition, what worked well, what didn’t
- Handling the coexistence of the two languages in the same service and across services
- What did we gain with the transition, what did we lose
Alfresco Tech Talk Live - REST API of the Future Gavin Cornwell
Alfresco Repository comes with a great REST API but some of these API's can be difficult to navigate and use. Gavin Cornwell, Engineering Manager, Content Services Practise at Alfresco will tell us about the future of REST API's, where the market is moving and how we are embracing this future at Alfresco.
Plataforma distribuída de Microserviços ou, como a Olist funcionaOsvaldo Santana Neto
Nessa apresentação eu demonstro com detalhes como a arquitetura da Olist foi implementada usando microserviços distribuídos que se comunicam através de um sistema de mensageria baseado no pattern Publisher/Subscribers (PubSub).
A plataforma Olist usa Python 3.6, Django, Django REST Framework, Loafer (asyncio), AWS SNS/SQS e PostgreSQL. Fazemos deployment no Heroku e desenvolvemos tudo isso com uma equipe distribuída por quase todas as regiões do Brasil (ainda falta um representante do Norte!).
Alfresco 5.2 Introduces New Public REST APIs
For an update, please see: https://www.slideshare.net/jvonka/exciting-new-alfresco-apis
https://www.meetup.com/Alfresco-Meetups/events/236987848/
An overview of the new and enhanced APIs will be discussed and some of the key endpoints demonstrated via Postman so that by the time you leave you should have enough knowledge to create a simple client or integration.
These APIs will also be the foundation for new clients developed for the Alfresco Digital Business Platform.
We'll have a sneak peek at what's coming next and leave plenty of time for questions, feedback and open discussion.
NoSQL Riak MongoDB Elasticsearch - All The Same?Eberhard Wolff
Gives a general introduction to NoSQL and modeling data with JSON. Goes on to compare MongoDB, Riak and Elasticsearch - that seem to be the same at first sight but are in fact pretty different. Presented at JavaLand.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
Apache Flink features two APIs which are based on relational algebra, a SQL interface and the so-called Table API, which is a LINQ-style API available for Scala and Java. Relational APIs are interesting because they are easy to use and queries can be automatically optimized and translated into efficient runtime code. Flink offers both APIs for streaming and batch data sources. This talk takes a look under the hood of Flink’s relational APIs. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries of both APIs into the same representation, leverages Apache Calcite to optimize them, and generates runtime code for efficient execution. Finally, the slides discuss potential improvements and give an outlook for future extensions and features.
Pipeline designer allows users to author their processes and provision them on Falcon. This should make building applications on Falcon over Hadoop fairly trivial. Falcon has the ability to operate with HCatalog tables natively. This means that there is a one to one correspondence between a Falcon feed and an HCatalog table. Between the feed definition in Falcon and the underlying table definition in HCatalog, there is adequate metadata about the data stored underneath. This data (sets of them) can then be operated over by a collection of transformations to extract more refined dataset/feed. This logic (currently represented via Oozie workflow / pig scripts / map-reduce jobs) is typically represented through the Falcon process. In this talk we walk through the details of the pipeline designer and the current state of this feature.
A presentation delivered to the Auckland Atlassian User Group on the 22nd February, 2012 - where we explore from a high level the Atlassian developer ecosystem, i.e. why you want to develop for it, how to develop for it etc.
This covered a number of things which are also landing in Jira 5 and above only i.e. activity streams and remote links.
http://www.meetup.com/Akl-AUG/events/47434772/
Designing a social network offers some exciting challenges to engineers. The system needs to operate at scale, to provide a responsive user experience and to be able to inspect user activity in order to both generate new content and improve how the existing content is delivered.
Event Driven Architectures are particularly suitable to handle these kind of challenges and highly scalable messaging systems such as Apache Kafka have been designed specifically to support the requirements of modern high volume applications.
In this talk we are describing how the Crowdmix back-end has been designed as an Event Based system running on top of Kafka. We are going to present the overall system architecture and discuss in more detail some of the different sub components processing those events in different fashions, from streaming based processing to batch processing passing through a lambda-style batch and stream cooperation.
We are going to conclude describing some lessons learned from our one-year journey in implementing and operating the system
Functional languages and Scala in particular have been attracting the interest of a growing number of developers in the latest years. In particular, Scala created a core of highly-motivated supporters in the so-called Reactive System area and in the Big Data domain.
In this talk we are going to describe how we evolved in CrowdMix from an initial implementation of the system based on Java8 to one mostly implemented in Scala. The talk will also describe how we evolved from a traditional service design to a highly scalable reactive system.
Although this could be possibly considered a less-common and simpler case than the majority of monolith legacy Java system conversions, there is a set of lessons learned that we would like to share.
In particular we will describe:
- How we approached the transition, what worked well, what didn’t
- Handling the coexistence of the two languages in the same service and across services
- What did we gain with the transition, what did we lose
Alfresco Tech Talk Live - REST API of the Future Gavin Cornwell
Alfresco Repository comes with a great REST API but some of these API's can be difficult to navigate and use. Gavin Cornwell, Engineering Manager, Content Services Practise at Alfresco will tell us about the future of REST API's, where the market is moving and how we are embracing this future at Alfresco.
Plataforma distribuída de Microserviços ou, como a Olist funcionaOsvaldo Santana Neto
Nessa apresentação eu demonstro com detalhes como a arquitetura da Olist foi implementada usando microserviços distribuídos que se comunicam através de um sistema de mensageria baseado no pattern Publisher/Subscribers (PubSub).
A plataforma Olist usa Python 3.6, Django, Django REST Framework, Loafer (asyncio), AWS SNS/SQS e PostgreSQL. Fazemos deployment no Heroku e desenvolvemos tudo isso com uma equipe distribuída por quase todas as regiões do Brasil (ainda falta um representante do Norte!).
Alfresco 5.2 Introduces New Public REST APIs
For an update, please see: https://www.slideshare.net/jvonka/exciting-new-alfresco-apis
https://www.meetup.com/Alfresco-Meetups/events/236987848/
An overview of the new and enhanced APIs will be discussed and some of the key endpoints demonstrated via Postman so that by the time you leave you should have enough knowledge to create a simple client or integration.
These APIs will also be the foundation for new clients developed for the Alfresco Digital Business Platform.
We'll have a sneak peek at what's coming next and leave plenty of time for questions, feedback and open discussion.
This presentation held in at Inovex GmbH in Munich in November 2015 was about a general introduction of the streaming space, an overview of Flink and use cases of production users as presented at Flink Forward.
January 2016 Flink Community Update & Roadmap 2016Robert Metzger
This presentation from the 13th Flink Meetup in Berlin contains the regular community update for January and a walkthrough of the most important upcoming features in 2016
Flink Streaming is the real-time data processing framework of Apache Flink. Flink streaming provides high level functional apis in Scala and Java backed by a high performance true-streaming runtime.
Step-by-Step Introduction to Apache Flink Slim Baltagi
This a talk that I gave at the 2nd Apache Flink meetup in Washington DC Area hosted and sponsored by Capital One on November 19, 2015. You will quickly learn in step-by-step way:
How to setup and configure your Apache Flink environment?
How to use Apache Flink tools?
3. How to run the examples in the Apache Flink bundle?
4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink?
5. How to write your Apache Flink program in an IDE?
http://www.learntek.org/product/apache-flink/
Apache Flink is an open source stream processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Apache Flink’s dataflow programming model provides event-at-a-time processing on both finite and infinite datasets. At a basic level, Flink programs consist of streams and transformations. Conceptually, a stream is a (potentially never-ending) flow of data records, and a transformation is an operation that takes one or more streams as input, and produces one or more output streams as a result. Programs can be written in Java, Scala, Python, and SQL and are automatically compiled and optimized into dataflow programs that are executed in a cluster or cloud environment.
http://www.learntek.org
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses. We are dedicated to designing, developing and implementing training programs for students, corporate employees and business professional.
https://www.learntek.org/apache-flink/
https://www.learntek.org/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
https://www.learntek.org/apache-flink/
https://www.learntek.org/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
Apache Flink Training
https://www.learntek.org/apache-flink/
https://www.learntek.org/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFlink Forward
http://flink-forward.org/kb_sessions/taking-a-look-under-the-hood-of-apache-flinks-relational-apis/
Apache Flink features two APIs which are based on relational algebra, a SQL interface and the so-called Table API, which is a LINQ-style API available for Scala and Java. Relational APIs are interesting because they are easy to use and queries can be automatically optimized and translated into efficient runtime code. Flink offers both APIs for streaming and batch data sources. This talk will take a look under the hood of Flink’s relational APIs. We will show the unified architecture to handle streaming and batch queries and explain how Flink translates queries of both APIs into the same representation, leverages Apache Calcite to optimize them, and generates runtime code for efficient execution. Finally, we will discuss potential improvements and give an outlook for future extensions and features.
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
These are the slides of my talk at the Chicago Apache Flink Meetup on April 19, 2016. This talk explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation, marks a new era of Real-Time and Real-World streaming analytics. The talk will map Flink's capabilities to streaming analytics use cases.
In this presentation Guido Schmutz talks about Apache Kafka, Kafka Core, Kafka Connect, Kafka Streams, Kafka and "Big Data"/"Fast Data Ecosystems, Confluent Data Platform and Kafka in Architecture.
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
Apache Flink is a community-driven open source and memory-centric Big Data analytics framework. It provides the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases.
Flink uses a mixture of Scala and Java internally, has very good Scala APIs and some of its libraries are basically pure Scala (FlinkML and Table).
At its core, it is a streaming dataflow execution engine and it also provides several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML) and graph processing (Gelly).
In this talk, you will learn in more details about:
What is Apache Flink, how it fits into the Big Data ecosystem and why it is the 4G (4th Generation) of Big Data Analytics frameworks?
How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment?
Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? What are the benchmarking results between Apache Flink and those other Big Data analytics frameworks?
Similar to Apache Flink Community Update March 2015 (20)
dA Platform is a production-ready platform for stream processing with Apache Flink®. The Platform includes open source Apache Flink, a stateful stream processing and event-driven application framework, and dA Application Manager, a central deployment and management component. dA Platform schedules clusters on Kubernetes, deploys stateful Flink applications, and controls these applications and their state.
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
Stratosphere is the next generation big data processing engine.
These slides introduce the most important features of Stratosphere by comparing it with Apache Hadoop.
For more information, visit stratosphere.eu
Based on university research, it is now a completely open-source, community driven development with focus on stability and usability.
Stratosphere Intro (Java and Scala Interface)Robert Metzger
A quick walk overview of Stratosphere, including our Scala programming interface.
See also bigdataclass.org for two self-paced Stratosphere Big Data exercises.
More information about Stratosphere: stratosphere.eu
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
2. Flink Community Updates
• What happened in the Flink community?
check out the monthly newsletter on the
blog
Subscribe to news@flink.apache.org
1flink.apache.org
3. What happened?
• Community decided to release Flink 0.9-
milestone1 next week.
Flink 0.9 will come a few weeks afterwards
• Flink runner for Google DataFlow API
available
• Focus on Streaming stability (YARN
container restart, Kafka source
checkpointing)
flink.apache.org 2
4. Now in master (0.9-SNAPSHOT)
• Expression API renamed to Table API
• Java support for Table API:
flink.apache.org 3
5. Now in master: Flink Machine
Learning Library
• Merged:
– ALS (Recommendations)
– Linear Regression & Multiple Linear Regression
– Utilities, basic data types (sparse vectors &
matrix)
• Overview of open issues:
https://issues.apache.org/jira/issues/?jql=com
ponent%20%3D%20%22Machine%20Learni
ng%20Library%22%20AND%20project%20%
3D%20FLINK
flink.apache.org 4
6. Flink on the Web
• Blogpost: Peeking into Apache Flink's Engine
Room [1]
• Naive Bayes on Apache Flink [2]
• Announcing Google Cloud Dataflow runner for
Apache Flink [3][4][5]
• How to factorize a 700 GB matrix with Apache
Flink [6]
[1] http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
[2] http://www.itshared.org/2015/03/naive-bayes-on-apache-flink.html
[3] http://googlecloudplatform.blogspot.de/2015/03/announcing-Google-Cloud-Dataflow-runner-for-Apache-
Flink.html
[4] http://www.data-artisans.com/dataflow.html
[5] http://www.heise.de/developer/meldung/Big-Data-Google-Cloud-Dataflow-bekommt-Runner-fuer-Apache-
Flink-2583392.html
[6] http://www.data-artisans.com/als.html
flink.apache.org 5
7. New Wiki pages with system
internals
• Data Exchange between tasks
• Type Extraction and Serialization
• Memory Management in (Batch API)
• Akka and Actors
flink.apache.org 6