Presented by Ibrahim Muhammadi. Founder - AppWorx.cc
Big Data is revolutionizing how businesses make decisions now. More and more decisions and strategies are now based on data.
PowerStream: Propelling Energy Innovation with Predictive Analytics SingleStore
This document discusses a presentation about MemSQL PowerStream, a product for predicting the global health of wind turbines. The presentation covers renewable energy news stories, introduces PowerStream, demonstrates high-speed data ingestion and predictive analytics using MemSQL and Spark, and shows how SQL queries can be pushed down to MemSQL for faster processing. It concludes with a question and answer section.
Janus graph lookingbackwardreachingforwardDemai Ni
JanusGraph: Looking Backward and Reaching Forward - by Jason Plurad (@pluradj):
The JanusGraph project started at the Linux Foundation earlier this year, but it is not the new kid on the block. We'll start with a look at the origins and evolution of this open source graph database through the lens of a few IBM graph use cases. We'll discuss the new features in latest release of JanusGraph, and then take a look at future directions to explore together with the open community.
How to Create the Google for Earth Data (XLDB 2015, Stanford)Rainer Sternfeld
Rainer Sternfeld presented on creating a Google-like platform for earth data using Planet OS. He described the challenges NOAA faces in managing tens of terabytes of weather data per day across scattered systems. Planet OS could index NOAA's metadata and downsample remote datasets via APIs. It would store chunked array data in object stores like S3 and provide on-demand computing via cloud services. This would make NOAA's large-scale data easily discoverable and machine-readable while addressing issues like data volume, transport, and real-time dissemination.
The Impact of Always-on Connectivity for Geospatial Applications and AnalysisSingleStore
This document summarizes MemSQL's geospatial database capabilities and features. It discusses how always-on connectivity and ubiquitous mobile devices have enabled new transportation-based geospatial applications using real-time taxi and rideshare data. It provides examples using New York taxi data, including demonstrations of MemSQL's Supercar and Zoomdata's TaxiStats applications that perform real-time geospatial analytics on streaming transportation data.
The Critical Role of IoT Data Integration to develop Big Data Applications (f...Rainer Sternfeld
HP predicts that by 2020, 40% of all data ever collected by the human kind will have been generated by sensors. But if you can't use the data, if you can search and discover it; and if you can't make it machine-readable, then the investment into intelligent sensor networks will be unused.
In this presentation, I discuss different cases of data integration and discovery, and how to turn this data into usable/readable information both for humans and machines, thus allowing data professionals, executives and data vendors all do what they do best, leaving data integration and discovery to professionals.
Airline Reservations and Routing: A Graph Use CaseJason Plurad
We've all been there before... you hear the announcement that your flight is canceled. Fellow passengers race to the gate agent to rebook on the next available flight. How do they quickly determine the best route from Berlin to San Francisco? Ultimately the flight route network is best solved as a graph problem. We will discuss our lessons learned from working with a major airline to solve this problem using JanusGraph database. JanusGraph is an open source graph database designed for massive scale. It is compatible with several pieces of the open source big data stack: Apache TinkerPop (graph computing framework), HBase, Cassandra, and Solr. We will go into depth about our approach to benchmarking graph performance and discuss the utilities we developed. We will share our comparison results for evaluating which storage backend use with JanusGraph. Whether you are productizing a new database or you are a frustrated traveler, a fast resolution is needed to satisfy everybody involved. Presented at DataWorks Summit Berlin on April 18, 2018
Presented at the Linked Data Benchmark Council (LDBC) Technical User Group (TUG) Meeting on June 8, 2018. http://www.ldbcouncil.org/blog/11th-tuc-meeting-university-texas-austin
Djangocon Europe 2017: Planet Friendly DjangoChris Adams
This document summarizes a presentation by Chris Adams on planet friendly web development with Django. Adams discusses several of his own projects focused on reducing carbon emissions, including APIs to calculate CO2 and apps for low-carbon travel and tea. He then outlines ways web developers can build more sustainably, such as optimizing server usage, using serverless architectures, reducing network packet waste, and deploying content compression techniques. The talk aims to raise awareness of technology's environmental impact and strategies for mitigating it.
PowerStream: Propelling Energy Innovation with Predictive Analytics SingleStore
This document discusses a presentation about MemSQL PowerStream, a product for predicting the global health of wind turbines. The presentation covers renewable energy news stories, introduces PowerStream, demonstrates high-speed data ingestion and predictive analytics using MemSQL and Spark, and shows how SQL queries can be pushed down to MemSQL for faster processing. It concludes with a question and answer section.
Janus graph lookingbackwardreachingforwardDemai Ni
JanusGraph: Looking Backward and Reaching Forward - by Jason Plurad (@pluradj):
The JanusGraph project started at the Linux Foundation earlier this year, but it is not the new kid on the block. We'll start with a look at the origins and evolution of this open source graph database through the lens of a few IBM graph use cases. We'll discuss the new features in latest release of JanusGraph, and then take a look at future directions to explore together with the open community.
How to Create the Google for Earth Data (XLDB 2015, Stanford)Rainer Sternfeld
Rainer Sternfeld presented on creating a Google-like platform for earth data using Planet OS. He described the challenges NOAA faces in managing tens of terabytes of weather data per day across scattered systems. Planet OS could index NOAA's metadata and downsample remote datasets via APIs. It would store chunked array data in object stores like S3 and provide on-demand computing via cloud services. This would make NOAA's large-scale data easily discoverable and machine-readable while addressing issues like data volume, transport, and real-time dissemination.
The Impact of Always-on Connectivity for Geospatial Applications and AnalysisSingleStore
This document summarizes MemSQL's geospatial database capabilities and features. It discusses how always-on connectivity and ubiquitous mobile devices have enabled new transportation-based geospatial applications using real-time taxi and rideshare data. It provides examples using New York taxi data, including demonstrations of MemSQL's Supercar and Zoomdata's TaxiStats applications that perform real-time geospatial analytics on streaming transportation data.
The Critical Role of IoT Data Integration to develop Big Data Applications (f...Rainer Sternfeld
HP predicts that by 2020, 40% of all data ever collected by the human kind will have been generated by sensors. But if you can't use the data, if you can search and discover it; and if you can't make it machine-readable, then the investment into intelligent sensor networks will be unused.
In this presentation, I discuss different cases of data integration and discovery, and how to turn this data into usable/readable information both for humans and machines, thus allowing data professionals, executives and data vendors all do what they do best, leaving data integration and discovery to professionals.
Airline Reservations and Routing: A Graph Use CaseJason Plurad
We've all been there before... you hear the announcement that your flight is canceled. Fellow passengers race to the gate agent to rebook on the next available flight. How do they quickly determine the best route from Berlin to San Francisco? Ultimately the flight route network is best solved as a graph problem. We will discuss our lessons learned from working with a major airline to solve this problem using JanusGraph database. JanusGraph is an open source graph database designed for massive scale. It is compatible with several pieces of the open source big data stack: Apache TinkerPop (graph computing framework), HBase, Cassandra, and Solr. We will go into depth about our approach to benchmarking graph performance and discuss the utilities we developed. We will share our comparison results for evaluating which storage backend use with JanusGraph. Whether you are productizing a new database or you are a frustrated traveler, a fast resolution is needed to satisfy everybody involved. Presented at DataWorks Summit Berlin on April 18, 2018
Presented at the Linked Data Benchmark Council (LDBC) Technical User Group (TUG) Meeting on June 8, 2018. http://www.ldbcouncil.org/blog/11th-tuc-meeting-university-texas-austin
Djangocon Europe 2017: Planet Friendly DjangoChris Adams
This document summarizes a presentation by Chris Adams on planet friendly web development with Django. Adams discusses several of his own projects focused on reducing carbon emissions, including APIs to calculate CO2 and apps for low-carbon travel and tea. He then outlines ways web developers can build more sustainably, such as optimizing server usage, using serverless architectures, reducing network packet waste, and deploying content compression techniques. The talk aims to raise awareness of technology's environmental impact and strategies for mitigating it.
Exploring Graph Use Cases with JanusGraphJason Plurad
Graph databases are relative newcomers in the NoSQL database landscape. What are some graph model and design considerations when choosing a graph database in your architecture? Let's take a tour of a couple graph use cases that we've collaborated on recently with our clients to help you better understand how and why a graph database can be integrated to help solve problems found with connected data. Presented at DataWorks Summit San Jose - IBM Meetup on June 18, 2018.
https://www.meetup.com/BigDataDevelopers/events/251307524/
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
Real-time applications of predictive models must be able to generate predictions at the rate that transactions are generated. Previously, such applications of models trained using R needed to be converted to other languages like C++ or Java to achieve the required throughput. In this talk, I’ll describe how to use the in-database R processing capabilities of Microsoft R Server to detect fraud in a SQL Server database of loan records at a rate exceeding one million transactions per second. I will also show the process of training the underlying gradient-boosted tree model on a large training set using the out-of-memory algorithms of Microsoft R.
One of the first problems a developer encounters when evaluating a graph database is how to construct a graph efficiently. Recognizing this need in 2014, TinkerPop's Stephen Mallette penned a series of blog posts titled "Powers of Ten" which addressed several bulkload techniques for Titan. Since then Titan has gone away, and the open source graph database landscape has evolved significantly. Do the same approaches stand the test of time? In this session, we will take a deep dive into strategies for loading data of various sizes into modern Apache TinkerPop graph systems. We will discuss bulkloading with JanusGraph, the scalable graph database forked from Titan, to better understand how its architecture can be optimized for ingestion. Presented at Data Day Texas on January 27, 2018.
Presented at Open Camps (Database Camp) in New York City on November 19, 2017. http://www.db.camp/2017/presentations/graph-computing-with-apache-tinkerpop
The JanusGraph project started at the Linux Foundation earlier this year, but it is not the new kid on the block. We'll start with a look at the origins and evolution of this open source graph database through the lens of a few IBM graph use cases. We'll discuss the new features in latest release of JanusGraph, and then take a look at future directions to explore together with the open community. Presented on October 18, 2017 at the Graph Technologies Meetup in Santa Clara, CA. https://www.meetup.com/_CAIDI/events/243122187/
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...Rainer Sternfeld
This talk focuses on how harnessing sensor data intelligently (proprietary, commercial and public) enables to build better applications, what are the operational challenges of oil spill responses, and what kind of sensor networks are being utilized in weather forecasting, environmental monitoring and beyond.
Planet OS is a software platform for real-world sensor data integration, designed for ocean, land, air and space-based applications. Planet OS has developed a powerful suite that combines data mining, integration, search, visualization, analytics and secure data exchange between parties. It offers a single interface to work with all your proprietary (local and remote), commercial or open data.
The document presents 12 facts about flash storage and its advantages over disk storage. Flash storage capacity is projected to grow to be 1000x more than disk storage by 2026. Flash storage is also more reliable than disk storage and flash memory costs have been decreasing rapidly. Flash storage density has been increasing 2-4x every 2 years, resulting in widespread adoption.
Presented by David Smith at The Data Science Summit, Chicago, April 20 2017.
The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT
YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta
PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition"
Speakers: Steven Yu, Burak Yavuz
Zillow's favorite big data & machine learning toolsnjstevens
This talk covers Zillow's favorite tools for keeping track of research, cluster computing, machine learning open source, workflow management, logging, deep learning and data storage
Driving the On-Demand Economy with Predictive AnalyticsSingleStore
Nikita Shamgunov, CTO and Co-founder of MemSQL, discusses how MemSQL enables real-time predictive analytics through its in-memory and scale-out database. MemSQL allows data from hundreds of thousands of machines to be analyzed and delivers value through real-time code deployment, anomaly detection, and A/B testing results. MemSQL is a scalable, elastic, and real-time data warehouse that can be deployed on-premises, as a managed cloud service, or in multi-cloud environments.
The document discusses scalable machine learning techniques for analyzing large datasets. It explains that while parts of the machine learning pipeline like data preparation are easily parallelizable, training steps involving gradient descent are more difficult to parallelize. However, there are approaches for scalable training such as stochastic gradient descent, parameter servers, and feature hashing that approximate the model to make distributed optimization feasible. The key aspects of scalable machine learning involve faster learning algorithms, approximating the optimization problem and features, and asynchronous distributed techniques rather than just relying on parallelization alone.
Designing a Better Planet with Big Data and Sensor Networks (for Intelligent ...Rainer Sternfeld
Planet OS is a data discovery engine designed for real world sensor data. One interface to access your local, remote, open and vendor data.
This presentation answers questions like:
• How is the growth of sensor data challenging traditional data management, storage and usability of it?
• What are the trends in machine data and how will sensor data change Big Data over the next decade?
• How many devices are there on the Internet today? What will happen to this map in 10 years?
• What is the sensor data value chain, what gives you competitive advantage over others?
• Why is sensor data hard?
• Examples and use cases of the markets utilizing the latest robotic and mobile sensing platforms on land (energy production, agriculture, connected cars, weather forecasting), in the ocean (oil & gas, marine acoustics, shipping, environmental monitoring), air (drones) and space (nanosatellites, data-driven weather forecasting).
• How Planet OS is solving these challenges with it's Data Discovery Engine and a mission to index the real world? What are the data types we work with? What are the applications and how having a single interface and a single index help organizations to increase their ROI of operations, emergency response and planning?
• The Industrial Internet (GE), The Internet of Everything (Cisco)
• Why Big Data clouds need trust management for secure operations over open networks? (Intertrust)
Building an IoT Kafka Pipeline in Under 5 MinutesSingleStore
This document discusses building an IoT Kafka pipeline using MemSQL in under 5 minutes. It begins with an overview of IoT, Kafka, and operational data warehouses. It then discusses MemSQL and how it functions as an operational data warehouse by continuously loading and querying data in real-time. The document demonstrates launching a MemSQL cluster, creating schemas and pipelines to ingest, transform, persist and analyze IoT data from Kafka. It emphasizes MemSQL's ability to handle different data types and scales from IoT at high throughput with low latency.
The document discusses how data from various sources was integrated into the HuisKluis app. Originally, all data was stored on the server and pages were rendered server-side. Modern approaches split data and applications, using APIs to retrieve targeted data as needed. Linked data and SPARQL queries help guide clients to additional related data across different APIs through standardized URIs. For data providers, it is recommended to provide APIs instead of just data dumps, and follow W3C standards for publishing data on the web to make it easier to use and link externally.
The document provides 10 facts about cloud storage to prepare attendees for the NetApp Insight conference in October and November. Some key facts include that 80% of companies see business benefits within 6 months of adopting cloud technologies, 90% of enterprises have implemented a cloud strategy, and global data center traffic is expected to triple from 2012 to 2017. The conferences will provide over 300 technical sessions on building data fabrics across flash, disk and cloud storage.
Credit Fraud Prevention with Spark and Graph AnalysisJen Aman
This document discusses using Spark and graph analysis to prevent credit card fraud in real-time. It describes how fraud costs billions annually and affects millions of people. Common fraud types are outlined. The solution involves combining multiple data sources using Spark and a graph database to score applications for fraud in real-time. A demo is shown using sample fraudulent data and a fraud prediction model. Performance metrics are provided for the Databricks and Visallo platforms used to ingest data and detect fraud.
APIs and Micro-services - 7 modern trends every IT professional should know a...Ibrahim Muhammadi
This is part 1 in a series of presentations I will be doing. This series is about 7 ground breaking changes that have happened in the IT world - changes that will affect how IT professionals will develop and deploy applications.
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
Tinder’s Quickfire Pipeline powers all things data at Tinder. It was originally built using AWS Kinesis Firehoses and has since been extended to use both Kafka and other event buses. It is the core of Tinder’s data infrastructure. This rich data flow of both client and backend data has been extended to service a variety of needs at Tinder, including Experimentation, ML, CRM, and Observability, allowing backend developers easier access to shared client side data. We perform this using many systems, including Kafka, Spark, Flink, Kubernetes, and Prometheus. Many of Tinder’s systems were natively designed in an RPC first architecture.
Things we’ll discuss decoupling your system at scale via event-driven architectures include:
– Powering ML, backend, observability, and analytical applications at scale, including an end to end walk through of our processes that allow non-programmers to write and deploy event-driven data flows.
– Show end to end the usage of dynamic event processing that creates other stream processes, via a dynamic control plane topology pattern and broadcasted state pattern
– How to manage the unavailability of cached data that would normally come from repeated API calls for data that’s being backfilled into Kafka, all online! (and why this is not necessarily a “good” idea)
– Integrating common OSS frameworks and libraries like Kafka Streams, Flink, Spark and friends to encourage the best design patterns for developers coming from traditional service oriented architectures, including pitfalls and lessons learned along the way.
– Why and how to avoid overloading microservices with excessive RPC calls from event-driven streaming systems
– Best practices in common data flow patterns, such as shared state via RocksDB + Kafka Streams as well as the complementary tools in the Apache Ecosystem.
– The simplicity and power of streaming SQL with microservices
Exploring Graph Use Cases with JanusGraphJason Plurad
Graph databases are relative newcomers in the NoSQL database landscape. What are some graph model and design considerations when choosing a graph database in your architecture? Let's take a tour of a couple graph use cases that we've collaborated on recently with our clients to help you better understand how and why a graph database can be integrated to help solve problems found with connected data. Presented at DataWorks Summit San Jose - IBM Meetup on June 18, 2018.
https://www.meetup.com/BigDataDevelopers/events/251307524/
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
Real-time applications of predictive models must be able to generate predictions at the rate that transactions are generated. Previously, such applications of models trained using R needed to be converted to other languages like C++ or Java to achieve the required throughput. In this talk, I’ll describe how to use the in-database R processing capabilities of Microsoft R Server to detect fraud in a SQL Server database of loan records at a rate exceeding one million transactions per second. I will also show the process of training the underlying gradient-boosted tree model on a large training set using the out-of-memory algorithms of Microsoft R.
One of the first problems a developer encounters when evaluating a graph database is how to construct a graph efficiently. Recognizing this need in 2014, TinkerPop's Stephen Mallette penned a series of blog posts titled "Powers of Ten" which addressed several bulkload techniques for Titan. Since then Titan has gone away, and the open source graph database landscape has evolved significantly. Do the same approaches stand the test of time? In this session, we will take a deep dive into strategies for loading data of various sizes into modern Apache TinkerPop graph systems. We will discuss bulkloading with JanusGraph, the scalable graph database forked from Titan, to better understand how its architecture can be optimized for ingestion. Presented at Data Day Texas on January 27, 2018.
Presented at Open Camps (Database Camp) in New York City on November 19, 2017. http://www.db.camp/2017/presentations/graph-computing-with-apache-tinkerpop
The JanusGraph project started at the Linux Foundation earlier this year, but it is not the new kid on the block. We'll start with a look at the origins and evolution of this open source graph database through the lens of a few IBM graph use cases. We'll discuss the new features in latest release of JanusGraph, and then take a look at future directions to explore together with the open community. Presented on October 18, 2017 at the Graph Technologies Meetup in Santa Clara, CA. https://www.meetup.com/_CAIDI/events/243122187/
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...Rainer Sternfeld
This talk focuses on how harnessing sensor data intelligently (proprietary, commercial and public) enables to build better applications, what are the operational challenges of oil spill responses, and what kind of sensor networks are being utilized in weather forecasting, environmental monitoring and beyond.
Planet OS is a software platform for real-world sensor data integration, designed for ocean, land, air and space-based applications. Planet OS has developed a powerful suite that combines data mining, integration, search, visualization, analytics and secure data exchange between parties. It offers a single interface to work with all your proprietary (local and remote), commercial or open data.
The document presents 12 facts about flash storage and its advantages over disk storage. Flash storage capacity is projected to grow to be 1000x more than disk storage by 2026. Flash storage is also more reliable than disk storage and flash memory costs have been decreasing rapidly. Flash storage density has been increasing 2-4x every 2 years, resulting in widespread adoption.
Presented by David Smith at The Data Science Summit, Chicago, April 20 2017.
The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT
YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta
PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition"
Speakers: Steven Yu, Burak Yavuz
Zillow's favorite big data & machine learning toolsnjstevens
This talk covers Zillow's favorite tools for keeping track of research, cluster computing, machine learning open source, workflow management, logging, deep learning and data storage
Driving the On-Demand Economy with Predictive AnalyticsSingleStore
Nikita Shamgunov, CTO and Co-founder of MemSQL, discusses how MemSQL enables real-time predictive analytics through its in-memory and scale-out database. MemSQL allows data from hundreds of thousands of machines to be analyzed and delivers value through real-time code deployment, anomaly detection, and A/B testing results. MemSQL is a scalable, elastic, and real-time data warehouse that can be deployed on-premises, as a managed cloud service, or in multi-cloud environments.
The document discusses scalable machine learning techniques for analyzing large datasets. It explains that while parts of the machine learning pipeline like data preparation are easily parallelizable, training steps involving gradient descent are more difficult to parallelize. However, there are approaches for scalable training such as stochastic gradient descent, parameter servers, and feature hashing that approximate the model to make distributed optimization feasible. The key aspects of scalable machine learning involve faster learning algorithms, approximating the optimization problem and features, and asynchronous distributed techniques rather than just relying on parallelization alone.
Designing a Better Planet with Big Data and Sensor Networks (for Intelligent ...Rainer Sternfeld
Planet OS is a data discovery engine designed for real world sensor data. One interface to access your local, remote, open and vendor data.
This presentation answers questions like:
• How is the growth of sensor data challenging traditional data management, storage and usability of it?
• What are the trends in machine data and how will sensor data change Big Data over the next decade?
• How many devices are there on the Internet today? What will happen to this map in 10 years?
• What is the sensor data value chain, what gives you competitive advantage over others?
• Why is sensor data hard?
• Examples and use cases of the markets utilizing the latest robotic and mobile sensing platforms on land (energy production, agriculture, connected cars, weather forecasting), in the ocean (oil & gas, marine acoustics, shipping, environmental monitoring), air (drones) and space (nanosatellites, data-driven weather forecasting).
• How Planet OS is solving these challenges with it's Data Discovery Engine and a mission to index the real world? What are the data types we work with? What are the applications and how having a single interface and a single index help organizations to increase their ROI of operations, emergency response and planning?
• The Industrial Internet (GE), The Internet of Everything (Cisco)
• Why Big Data clouds need trust management for secure operations over open networks? (Intertrust)
Building an IoT Kafka Pipeline in Under 5 MinutesSingleStore
This document discusses building an IoT Kafka pipeline using MemSQL in under 5 minutes. It begins with an overview of IoT, Kafka, and operational data warehouses. It then discusses MemSQL and how it functions as an operational data warehouse by continuously loading and querying data in real-time. The document demonstrates launching a MemSQL cluster, creating schemas and pipelines to ingest, transform, persist and analyze IoT data from Kafka. It emphasizes MemSQL's ability to handle different data types and scales from IoT at high throughput with low latency.
The document discusses how data from various sources was integrated into the HuisKluis app. Originally, all data was stored on the server and pages were rendered server-side. Modern approaches split data and applications, using APIs to retrieve targeted data as needed. Linked data and SPARQL queries help guide clients to additional related data across different APIs through standardized URIs. For data providers, it is recommended to provide APIs instead of just data dumps, and follow W3C standards for publishing data on the web to make it easier to use and link externally.
The document provides 10 facts about cloud storage to prepare attendees for the NetApp Insight conference in October and November. Some key facts include that 80% of companies see business benefits within 6 months of adopting cloud technologies, 90% of enterprises have implemented a cloud strategy, and global data center traffic is expected to triple from 2012 to 2017. The conferences will provide over 300 technical sessions on building data fabrics across flash, disk and cloud storage.
Credit Fraud Prevention with Spark and Graph AnalysisJen Aman
This document discusses using Spark and graph analysis to prevent credit card fraud in real-time. It describes how fraud costs billions annually and affects millions of people. Common fraud types are outlined. The solution involves combining multiple data sources using Spark and a graph database to score applications for fraud in real-time. A demo is shown using sample fraudulent data and a fraud prediction model. Performance metrics are provided for the Databricks and Visallo platforms used to ingest data and detect fraud.
APIs and Micro-services - 7 modern trends every IT professional should know a...Ibrahim Muhammadi
This is part 1 in a series of presentations I will be doing. This series is about 7 ground breaking changes that have happened in the IT world - changes that will affect how IT professionals will develop and deploy applications.
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
Tinder’s Quickfire Pipeline powers all things data at Tinder. It was originally built using AWS Kinesis Firehoses and has since been extended to use both Kafka and other event buses. It is the core of Tinder’s data infrastructure. This rich data flow of both client and backend data has been extended to service a variety of needs at Tinder, including Experimentation, ML, CRM, and Observability, allowing backend developers easier access to shared client side data. We perform this using many systems, including Kafka, Spark, Flink, Kubernetes, and Prometheus. Many of Tinder’s systems were natively designed in an RPC first architecture.
Things we’ll discuss decoupling your system at scale via event-driven architectures include:
– Powering ML, backend, observability, and analytical applications at scale, including an end to end walk through of our processes that allow non-programmers to write and deploy event-driven data flows.
– Show end to end the usage of dynamic event processing that creates other stream processes, via a dynamic control plane topology pattern and broadcasted state pattern
– How to manage the unavailability of cached data that would normally come from repeated API calls for data that’s being backfilled into Kafka, all online! (and why this is not necessarily a “good” idea)
– Integrating common OSS frameworks and libraries like Kafka Streams, Flink, Spark and friends to encourage the best design patterns for developers coming from traditional service oriented architectures, including pitfalls and lessons learned along the way.
– Why and how to avoid overloading microservices with excessive RPC calls from event-driven streaming systems
– Best practices in common data flow patterns, such as shared state via RocksDB + Kafka Streams as well as the complementary tools in the Apache Ecosystem.
– The simplicity and power of streaming SQL with microservices
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...Databricks
We will present the design and evolution of Nvidia's 100% Self-Service Streaming Big-Data Platform (ETL, Analytics, AI Training & Inferencing) powered by Spark and Nvidia GPUs. We will discuss the architecture, major challenges that we faced, and lessons learned along the way. Nvidia's data platform processes 10's of billions of events per day, supporting several Nvidia products like GPU Cloud, GeForce NOW Cloud Gaming, AI Smart Cities, DriveSim for Self Driving cars etc. In this talk, we are going to deep dive on Nvidia's next generation data platform with new custom built frameworks, automation tools, and a monitoring system on top of Spark. Thus empowering our developers to build new Spark-powered applications at the speed of light (SOL) with full self-service unified data flows. We will showcase these new tools : a) Zero-engineering dashboards, b) Out-of-the box Spark Streaming applications with automated schema management, c) Custom Spark Streaming to Elastic search connector with enhanced security, d) GDPR compliant SQL access control and auditing with a new custom token management framework, e) Migration from logstash clusters to Spark Streaming for log parsing, etc. We will discuss how decoupling Data-Platform and Applications helped us achieve the next level of scale, self-service, and, security. Finally, we will demo our Platform's App-Store, where developers can shop for new Apps and deploy them with ease - with automated dashboards, streaming ETL, analytics, monitoring, AI training and inferencing. Extended Description: With structured telemetry events and unstructured logs growing at 1000% rate year-over-year, it is extremely important to handle this scale with strict SLAs and high reliability while maintaining extremely low latency. We will discuss how we handled these scaling & security concerns to solve business requirements. Additionally, we will be open-sourcing some of our custom spark frameworks during the talk.
Speakers: Satish Dandu, Rohit Kulkarni
apidays LIVE Helsinki & North 2022_Apps without APIsapidays
apidays LIVE Helsinki & North: API Ecosystems - Connecting Physical and Digital
March 16 & 17, 2022
Apps without APIs - Leveraging the stack that we all use, but never think about
Sampo Savolainen, CTO at Spatineo
ALT-F1.BE : The Accelerator (Google Cloud Platform)Abdelkrim Boujraf
The Accelerator is an IT infrastructure able to collect and analyze a massive amount of public data on the WWW.
The Accelerator leverages the untapped potential of web data with the first solution designed for diverse sectors,
completely scalable, available on-premise, and cloud-provider agnostic.
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022HostedbyConfluent
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Event-first thinking and streaming help organizations transition from followers to leaders in the market. A reliable, scalable, and economical streaming architecture helps them get there.
This talk first explores the ""classic streaming stack,"" based on the Lambda architecture, its origin, and why it didn't pick up amongst data-driven organizations. The modern streaming stack (MSS) is a lean, cloud-native, and economical alternative to classic streaming architectures, where it aims to make event-driven real-time applications viable for organizations.
The second half of the talk explores the MSS in detail, including its core components, their purposes, and how Kappa architecture has influenced it. Moreover, the talk lays out a few considerations before planning a new streaming application within an organization. The talk concludes by discussing the challenges in the streaming world and how vendors are trying to overcome them in the future.
This document summarizes how information technology (IT) infrastructure and operations have changed from expensive and slow on-premise systems to cheaper and faster cloud-based systems. It notes that IT used to require renting and maintaining thousands of servers, but now services allow provisioning servers quickly and returning them just as fast. Where systems used to support small numbers of users, they now must scale to massive "web scale." Tool usage has proliferated from using just a few tools to manage dependencies to using many different monitoring and analytics tools. Delivery cycles have accelerated from biannual releases to continuous delivery. It promotes a next-generation monitoring solution to help development and operations teams address these modern cloud-era challenges through data aggregation, correlation, collaboration
Serverless Architecture in application development - 7 modern trends every IT...Ibrahim Muhammadi
This is part 3 of 7 in the series "7 modern trends every IT pro should know about"
This part introduces the concept of serverless architecture in the cloud.
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
These slides accompanied a June 4th, 2016 presentation made by Dan Gillean of Artefactual Systems at the Association of Canadian Archivists' 2016 Conference in Montreal, QC, Canada.
This presentation aims to examine several existing or emerging computing paradigms, with specific examples, to imagine how they might inform next-generation archival systems to support digital preservation, description, and access. Topics covered include:
- Distributed Version Control and git
- P2P architectures and the BitTorrent protocol
- Linked Open Data and RDF
- Blockchain technology
The session is part of an attempt by the ACA to create interactive "working sessions" at its conferences. Accompanying notes can be found at: http://bit.ly/tech-Proche
Participants were also asked to use the Twitter hashtag of #techProche for online interaction during the session.
Big data comes from various sources in different formats and structures. It includes social media data, search engine data, medical records, and IoT device data. YouTube alone generates billions of views from over 1 billion users uploading 1.2 million videos daily. Big data is characterized by its variety, velocity, and volume. Hadoop is an open source framework used to process and store big data. It has two core components - HDFS for storage and MapReduce for processing. Hadoop distributes data and tasks across nodes, provides fault tolerance, and scales dynamically with hardware.
This document summarizes the National Security Agency's (NSA) development and use of an internal private cloud using OpenStack for infrastructure as a service (IaaS). It describes how the NSA stood up an initial pilot OpenStack cloud in two weeks to host lab infrastructure. This grew into a production IaaS cloud hosting hundreds of users and mission systems. The cloud provides agility, flexibility and scalability while maintaining security, accountability and central management through APIs, logging, and metrics. The NSA contributes to open source projects like OpenStack and encourages more community participation and hiring.
IESL Talk Series: Apache System Projects in the Real WorldSrinath Perera
- LEAD was a large-scale e-science project funded by the NSF that used Apache technologies like Axis2, ODE, and others to build a dynamic weather analysis system across multiple universities in the US.
- It faced challenges at large scale including distributed resources, long-running jobs, large data, and usage spikes from many parallel users.
- Key subsystems included workflows, data, and messaging which also presented challenges around resource utilization, scalability, fault-tolerance, and security at that scale.
- Over time, LEAD transitioned major components to Apache projects and has now joined the Apache incubator as the Apache Airavata project.
Digital Business Transformation in the Streaming EraAttunity
Enterprises are rapidly adopting stream computing backbones, in-memory data stores, change data capture, and other low-latency approaches for end-to-end applications. As businesses modernize their data architectures over the next several years, they will begin to evolve toward all-streaming architectures. In this webcast, Wikibon, Attunity, and MemSQL will discuss how enterprise data professionals should migrate their legacy architectures in this direction. They will provide guidance for migrating data lakes, data warehouses, data governance, and transactional databases to support all-streaming architectures for complex cloud and edge applications. They will discuss how this new architecture will drive enterprise strategies for operationalizing artificial intelligence, mobile computing, the Internet of Things, and cloud-native microservices.
Link to the Wikibon report - wikibon.com/wikibons-2018-big-data-analytics-trends-forecast
Link to Attunity Streaming CDC Book Download - http://www.bit.ly/cdcbook
Link to MemSQL's Free Data Pipeline Book - http://go.memsql.com/oreilly-data-pipelines
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
If there were a buzzword of the hour, it would certainly be "data mesh"! This new architectural paradigm unlocks analytic data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios.
As such, the data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a data mesh infrastructure must be real-time, decoupled, reliable, and scalable.
This presentation explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and - complemented by many other data platforms like a data warehouse, data lake, and lakehouse - solve real business problems.
There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job.
A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.
Data centers are growing to accommodate more internet-connected devices, with innovations helping achieve network coverage for billions of devices by 2020. As data centers grow, trends like software-driven infrastructure, microtechnology, and alternative energy use are making data centers more efficient by consolidating resources and reducing size. Hyperconvergence allows more efficient use of rack space by consolidating computer storage, networking, and virtualization in compact 2U systems from companies like Simplivity and Nutanix.
1st Birmingham Big Data Science Group meetup Faizan Javed
This document introduces the Birmingham Big Data Science Group (BIDS) and discusses big data and related technologies. It provides an overview of big data, large-scale distributed systems, NoSQL databases, and intelligent algorithms. Examples of prominent NoSQL database users and the Hadoop-based SMAQ stack are discussed. The document also covers next-generation systems beyond MapReduce/Hadoop and concludes that big data is a challenging and promising area.
Confluent hosted a technical thought leadership session to discuss how leading organisations move to real-time architecture to support business growth and enhance customer experience.
Event Streaming CTO Roundtable for Cloud-native Kafka ArchitecturesKai Wähner
Technical thought leadership presentation to discuss how leading organizations move to real-time architecture to support business growth and enhance customer experience. This is a forum to discuss use cases with your peers to understand how other digital-native companies are utilizing data in motion to drive competitive advantage.
Agenda:
- Data in Motion with Event Streaming and Apache Kafka
- Streaming ETL Pipelines
- IT Modernisation and Hybrid Multi-Cloud
- Customer Experience and Customer 360
- IoT and Big Data Processing
- Machine Learning and Analytics
Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics driven by data streams, technology advances, business needs and consumer demands, 3) the growth of in-memory analytics using tools like Alluxio and RocksDB, 4) rapid application development using APIs, notebooks, GUIs and microservices, 5) open sourcing of machine learning systems by tech giants, and 6) hybrid cloud computing models for deploying big data applications both on-premise and in the cloud.
Similar to Big Data - part 5/7 of "7 modern trends that every IT Pro should know about" (20)
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
3. With more and more digitalization, there is huge
amounts of structured, semistructured and
unstructured data that is being generated.
cc: phsymyst - https://www.flickr.com/photos/78624556@N08
4. In the early days of this explosive growth in digital
data, businesses used to discard additional data
because there was no feasible way to make any sense
out of itcc: Kentrosaurus - https://www.flickr.com/photos/86125591@N00
5. But this is changing rapidly with advancements in
infrastructure needed for data storage and processing
collectively known as BIG DATA
cc: Tom Raftery - https://www.flickr.com/photos/67945918@N00
6. 3Vs of big data: extreme volume of data, wide
variety of data types and the velocity at which the
data must be processed
cc: dalbera - https://www.flickr.com/photos/72746018@N00
7. Such voluminous data can come from different
sources, such as business sales records, the collected
results of experiments, real-time sensors used in IOT
and morecc: bionicteaching - https://www.flickr.com/photos/29096601@N00
8. Adequate compute power is needed to achieve the desired
velocity. This can potentially demand hundreds or thousands
of servers that can distribute the work and operate
collaboratively
cc: midom - https://www.flickr.com/photos/81295370@N00
9. In this short presentation we will look at some of
the more popular tools that have made the Big
Data revolution possible.
cc: Glenn Zucman - https://www.flickr.com/photos/18182611@N00
11. Distributed data storage and processing on consumer
grade hardware makes big data feasible. One open
source project for this is Hadoop.
cc: NASA Goddard Photo and Video - https://www.flickr.com/photos/24662369@N07
12. Hadoop enables distributed processing of large data sets
across clusters of computers using simple programming
models. It is designed to scale up to thousands of machines.
cc: solofotones - https://www.flickr.com/photos/14754973@N08
13. Rather than rely on hardware to deliver high-
availability, the Hadoop library is designed to detect
and handle failures at the application layer, so
delivering a highly-available service.cc: neil cummings - https://www.flickr.com/photos/23874985@N07
15. Another open source tool that is used for Big Data is
Elasticsearch which can do blazing fast searches on
semistructured or unstructured datasets.
cc: DocChewbacca - https://www.flickr.com/photos/49462908@N00
16. Elasticsearch is a part of the Elastic stack or the ELK
stack that also contains Logstash (a data collection and
log parsing tool) and Kibana (for analytics and
visualization)cc: PLeia2 - https://www.flickr.com/photos/64684255@N00
18. Data migration using ETL (Extract - Transform - Load) does
not work well with Big Data and hence the traditional ETL
architecture is now changing to real-time data streaming
cc: SidPix - https://www.flickr.com/photos/22357152@N02
19. Apache Kafka is a high-throughput distributed message
system that is being adopted by hundreds of
companies to manage their real-time data.
cc: r2hox - https://www.flickr.com/photos/72764087@N00
20. Kafka is a perfect tool for building data
pipelines: it is reliable, scalable, and
efficient.cc: ikarusmedia - https://www.flickr.com/photos/32650580@N06
21. R - the language and environment for
statistical computing
22. R is an integrated suite of software
facilities for data manipulation, calculation
and graphical display.cc: Crystal Writer - https://www.flickr.com/photos/17483452@N00
23. With over 2 million users worldwide R is rapidly
becoming the leading programming language in
statistics and data science.
cc: Marc_Smith - https://www.flickr.com/photos/49503165485@N01
24. It is a great tool for data analysis and
can be efficiently used on very large
data sets.cc: Régis Gaidot - https://www.flickr.com/photos/22019171@N00
25. Big Data is the next frontier for innovation,
competition and productivity - in all fields from
healthcare to retail, from manufacturing to personal
and location data.cc: danielfoster437 - https://www.flickr.com/photos/17423713@N03
26. In most industries, established competitors and new
entrants will leverage data-driven strategies to
innovate, compete, and capture value from deep real-
time informationcc: verbeeldingskr8 - https://www.flickr.com/photos/35429044@N04
27. We at appworx.cc offer data services that can help
retail and other clients achieve their big data goals
quickly.
https://www.appworx.cc/datacc: Jason Michael - https://www.flickr.com/photos/70194213@N00