1. The document discusses challenges in building real-time, data-driven applications including dealing with big data, privacy concerns, performing some real-time analysis, and enabling real-time retrieval of large datasets.
2. It describes using Hadoop to store, enrich, and preprocess raw logs totaling around 40TB of data, while addressing privacy needs.
3. The author details techniques used to enable fast real-time retrieval of data points within a given date range and radius from a center location, such as indexing data and using temporary tables.
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
GeoMesa is an open-source toolkit for processing and analyzing spatio-temporal data, such as IoT and sensor-produced observations, at scale. It provides a consistent API for querying and analyzing data on top of distributed databases (e.g. HBase, Accumulo, Bigtable, Cassandra) and messaging networks (e.g. Kafka) to handle batch analysis of historical archives of data and low-latency processing of data in-stream.
GeoMesa has deep integration with Spark SQL. It has added spatial types (e.g. Point, LineString, Polygons), spatial predicates (st_contains, st_intersects, etc.), and geometry processing functions (e.g. st_buffer, st_convexHull, etc.) to Spark SQL. It also optimizes the processing of these extensions by integrating with the Catalyst SQL optimizer to intercept SQL statements with spatial predicates and provision RDDs based on the underlying spatial index.
This session will describe the implementation of the GeoMesa Spark SQL integration, illustrate its application in production systems and demonstrate spatial aggregations and analytics using map-based visualizations.
MongoDB Solution for Internet of Things and Big DataStefano Dindo
Internet of Things è uno degli scenari di mercato più importanti su cui investire entro il 2020.
L'Internet of Things permette di trasferire sul Web la vita reale delle persone grazie all'interazione con oggetti e spazi fisici scambiando un grande volume di dati.
Durante il Lab è stata fornita una descrizione di architettura necessaria a supportare progetti di Internet of Things con un focus sull'organizzazione dei dati all'interno di MongoDB, database NoSQL Leader di mercato, per raccogliere ed analizzare grandi volumi di dati in tempo reale ed in modo efficiente.
The first task of the project is to generate a set of more than one million data points to be used as input for the k-means clustering algorithm. Next k-means algorithm is implemented following the MapReduce framework.
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịHong Ong
Bài review cách tính nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị. Ứng dụng trong nhiều lĩnh vực như: telecome, internet routing, social network analysis, etc.
Strata 2014 Talk:Tracking a Soccer Game with Big DataSrinath Perera
Mobile devices, sensors and GPS are driving the demand to handle big data in both batch and real time. This presentation discusses how we used complex event processing (CEP) and MapReduce based technologies to track and process data from a soccer match as part of the annual DEBS event processing challenge. In 2013, the challenge included a data set generated by a real soccer match in which sensors were placed in the soccer ball and players’ shoes. This session will review how we used CEP to implement DESB challenge and achieved throughput in excess of 100,000 events/sec. It also will examine how we extended the solution to conduct batch processing using business activity monitoring (BAM) using the same framework, enabling users to obtain both instant analytics as well as more detailed batch processing based results.
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
GeoMesa is an open-source toolkit for processing and analyzing spatio-temporal data, such as IoT and sensor-produced observations, at scale. It provides a consistent API for querying and analyzing data on top of distributed databases (e.g. HBase, Accumulo, Bigtable, Cassandra) and messaging networks (e.g. Kafka) to handle batch analysis of historical archives of data and low-latency processing of data in-stream.
GeoMesa has deep integration with Spark SQL. It has added spatial types (e.g. Point, LineString, Polygons), spatial predicates (st_contains, st_intersects, etc.), and geometry processing functions (e.g. st_buffer, st_convexHull, etc.) to Spark SQL. It also optimizes the processing of these extensions by integrating with the Catalyst SQL optimizer to intercept SQL statements with spatial predicates and provision RDDs based on the underlying spatial index.
This session will describe the implementation of the GeoMesa Spark SQL integration, illustrate its application in production systems and demonstrate spatial aggregations and analytics using map-based visualizations.
MongoDB Solution for Internet of Things and Big DataStefano Dindo
Internet of Things è uno degli scenari di mercato più importanti su cui investire entro il 2020.
L'Internet of Things permette di trasferire sul Web la vita reale delle persone grazie all'interazione con oggetti e spazi fisici scambiando un grande volume di dati.
Durante il Lab è stata fornita una descrizione di architettura necessaria a supportare progetti di Internet of Things con un focus sull'organizzazione dei dati all'interno di MongoDB, database NoSQL Leader di mercato, per raccogliere ed analizzare grandi volumi di dati in tempo reale ed in modo efficiente.
The first task of the project is to generate a set of more than one million data points to be used as input for the k-means clustering algorithm. Next k-means algorithm is implemented following the MapReduce framework.
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịHong Ong
Bài review cách tính nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị. Ứng dụng trong nhiều lĩnh vực như: telecome, internet routing, social network analysis, etc.
Strata 2014 Talk:Tracking a Soccer Game with Big DataSrinath Perera
Mobile devices, sensors and GPS are driving the demand to handle big data in both batch and real time. This presentation discusses how we used complex event processing (CEP) and MapReduce based technologies to track and process data from a soccer match as part of the annual DEBS event processing challenge. In 2013, the challenge included a data set generated by a real soccer match in which sensors were placed in the soccer ball and players’ shoes. This session will review how we used CEP to implement DESB challenge and achieved throughput in excess of 100,000 events/sec. It also will examine how we extended the solution to conduct batch processing using business activity monitoring (BAM) using the same framework, enabling users to obtain both instant analytics as well as more detailed batch processing based results.
React with D3: DOM Manipulation OrchestrationElden Park
Ever considered putting D3 objects into an existing React App? While creating a data visualizer, I have come across some of the constraints and solutions. Most importantly, how could we propagate state between the two?
Big data streams, Internet of Things, and Complex Event Processing Improve So...Chris Haddad
Teams gain a competitive edge by analyzing Big Data streams. In this session, Chris will describe how complex event processing (CEP) and MapReduce based technologies can improve soccer team performance. Soccer match activity data captured by embedded sensors were streamed and analyzed to understand how player actions impact soccer play.
This presentation will dive into a development team’s use case for choosing MongoDB as their spatially enabled NoSQL solution. The talk will also cover how the integration of GeoServer can expand the accessibility of your data. GeoServer is the open source implementation of Open Geospatial Consortium (OGC) standards and a core component of the Geospatial Web.
How can you use PosgreSQL as a schemaless (NoSQL) database? Here we cover our use case and highlight upcoming features in postgres 9.4 and its integration with Django 1.7
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this webinar, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
How does Twitter track the top trending topics?
How does Amazon keep track of the top-selling items for the day?
How many cabs have been booked this month using your App?
Is the password that a new user is choosing a common/compromised password?
Modern web-scale systems process billions of transactions and generate terabytes of data every single day. In order to find answers to questions against this data, one would initiate a multi-minute query against a NoSQL datastore or kick off a batch job written in a distributed processing framework such as Spark or Flink. However, these jobs are throughput-heavy and not suited for realtime low-latency queries. However, you and your customers would like to have all this information "right now".
At the end of this talk, you'll realize that you can power these low-latency queries and with incredibly low memory footprint "IF" you are willing to accept answers that are, say, 96-99% accurate. This talk introduces some of the go-to probabilistic data structures that are used by organisations with large amounts of data - specifically Bloom filter, Count Min Sketch and HyperLogLog.
The Weather of the Century Part 3: VisualizationMongoDB
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this presentation, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
The weather is everywhere and always. That makes for a lot of data. This talk will walk you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application. We’ll discuss loading and indexing terabytes of data in a sharded cluster, and optimizing the schema design for interactive exploration. MongoDB also natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. You'll earn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
A simple way to develop in java with performance in mind. In this presentation we will look at some of the basics that we tend to miss when developing, which eventually leads to memory inefficiencies in our applications.
Real time data driven applications (SQL vs NoSQL databases)GoDataDriven
Content and talk by Giovani Lanzani (GoDataDriven) at No-SQL Matters talk in Dublin (September 2014).
Big Data: Everybody talks about it, nobody knows how to do it. Everyone else thinks someone else is doing it, so claims to be doing it.
Giovanni covers what real time data driven applications are, presents one of the app build for one of GoDataDriven customers, what challenges arose and what database helped GoDataDriven achieve the level of performance they wanted.
Content and talk by Giovani Lanzani (GoDataDriven) at SEA Amsterdam in November 2014. Real time data driven applications using Python and pandas as backend
React with D3: DOM Manipulation OrchestrationElden Park
Ever considered putting D3 objects into an existing React App? While creating a data visualizer, I have come across some of the constraints and solutions. Most importantly, how could we propagate state between the two?
Big data streams, Internet of Things, and Complex Event Processing Improve So...Chris Haddad
Teams gain a competitive edge by analyzing Big Data streams. In this session, Chris will describe how complex event processing (CEP) and MapReduce based technologies can improve soccer team performance. Soccer match activity data captured by embedded sensors were streamed and analyzed to understand how player actions impact soccer play.
This presentation will dive into a development team’s use case for choosing MongoDB as their spatially enabled NoSQL solution. The talk will also cover how the integration of GeoServer can expand the accessibility of your data. GeoServer is the open source implementation of Open Geospatial Consortium (OGC) standards and a core component of the Geospatial Web.
How can you use PosgreSQL as a schemaless (NoSQL) database? Here we cover our use case and highlight upcoming features in postgres 9.4 and its integration with Django 1.7
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this webinar, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
How does Twitter track the top trending topics?
How does Amazon keep track of the top-selling items for the day?
How many cabs have been booked this month using your App?
Is the password that a new user is choosing a common/compromised password?
Modern web-scale systems process billions of transactions and generate terabytes of data every single day. In order to find answers to questions against this data, one would initiate a multi-minute query against a NoSQL datastore or kick off a batch job written in a distributed processing framework such as Spark or Flink. However, these jobs are throughput-heavy and not suited for realtime low-latency queries. However, you and your customers would like to have all this information "right now".
At the end of this talk, you'll realize that you can power these low-latency queries and with incredibly low memory footprint "IF" you are willing to accept answers that are, say, 96-99% accurate. This talk introduces some of the go-to probabilistic data structures that are used by organisations with large amounts of data - specifically Bloom filter, Count Min Sketch and HyperLogLog.
The Weather of the Century Part 3: VisualizationMongoDB
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this presentation, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
The weather is everywhere and always. That makes for a lot of data. This talk will walk you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application. We’ll discuss loading and indexing terabytes of data in a sharded cluster, and optimizing the schema design for interactive exploration. MongoDB also natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. You'll earn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
A simple way to develop in java with performance in mind. In this presentation we will look at some of the basics that we tend to miss when developing, which eventually leads to memory inefficiencies in our applications.
Real time data driven applications (SQL vs NoSQL databases)GoDataDriven
Content and talk by Giovani Lanzani (GoDataDriven) at No-SQL Matters talk in Dublin (September 2014).
Big Data: Everybody talks about it, nobody knows how to do it. Everyone else thinks someone else is doing it, so claims to be doing it.
Giovanni covers what real time data driven applications are, presents one of the app build for one of GoDataDriven customers, what challenges arose and what database helped GoDataDriven achieve the level of performance they wanted.
Content and talk by Giovani Lanzani (GoDataDriven) at SEA Amsterdam in November 2014. Real time data driven applications using Python and pandas as backend
In an R&D company fast prototyping is vital to develop new projects or proofs of concept quickly and inexpensively. In this talk we will demonstrate how real fast and agile development can be achieved with MongoDB and dynamic languages, with examples and best practices. All the code shown is already uploaded to a public Git repository - https://github.com/pablito56/py-eshop
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
A presentation given to the Sydney Users of R forum about an open source R package I developed for querying Google Analytics data.
For instructions on getting started with ganalytics please refer to the readme file here: https://github.com/jdeboer/ganalytics/blob/master/README.md
It would be great to hear any feedback or questions you have about the ganalytics package or the presentation. Any difficulties you might encounter with installing or using the ganalytics package, please let me know so that it can be made easier for everyone. Submit issues here: https://github.com/jdeboer/ganalytics/issues/
Contributions to the package are welcome:
- Package documentation
- Adding examples and demos
- Testing and finding bugs to fix
- Ideas for improvements or new features
Thanks for your interest. Feel free to reach out to me via twitter: @johannux
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.
Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
Beyond php it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system using tools like: Web Services,Spark,Cassandra,MongoDB,AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowRomain Dorgueil
Zero-to-one hands-on introduction to building a business dashboard using Bonobo ETL, Apache Airflow, and a bit of Grafana (because graphs are cool). The talk is based on the early version of our tools to visualize apercite.fr website. Plan, Implementation, Visualization, Monitoring and Iterate from there.
A talk about data workflow tools in Metrics Monday Helsinki.
Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk, we’ll mention all of the aspects that you should take into consideration when monitoring a distributed system using tools like Web Services, Spark, Cassandra, MongoDB, AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Similar to Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014 (20)
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...NoSQLmatters
While metrics generated by static code analysis are well established as predictors of possible future defects, there is another untapped source of useful information, namely your source code revision history. This presentation will discuss converting this revision information into a graph representation, various defect prediction models and how to generate their related change metrics through graph traversal, as well as the potential applications and benefits of these graph enabled prediction models.
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...NoSQLmatters
PostgreSQL is well known being an object-relational database management system. In it`s core PostgreSQL is schema-aware dealing with fixed database tables and column types. However, recent versions of PostgreSQL made it possible to deal with schema-free data. Learn which new features PostgreSQL supports and how to use those features in your application.
NoSQL matters, on that much I'm sure we can all agree. But if we take a closer look, what really matters when it comes to choosing a data store and/or a data processing platform? What really matters when it comes to getting the most out of that platform? And what is really going to matter as we take things to the next level?
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...NoSQLmatters
In this talk, Peter will cover his experience using Spark, Cassandra & Kafka to build a real time analytics platform that processed billions events a day. He will cover the challenges in how to turn all those raw events into actionable insights. He will also cover scaling the platform across multiple regions, as well as across multiple cloud environments.
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters
Data analysis is an exploratory process that requires a variety of tools and a flexible data store. Data analysis projects are easy to start but quickly become difficult to manage and error prone when depending on file-based data storage. Relational databases are poorly equipped to accommodate the dynamic demands complex analysis. This talk describes best practices for using MongoDB for analytics projects. Examples will be drawn from a large scale text mining project (approximately 25 million documents) that applies machine learning (neural networks and support vector machines) and statistical analysis. Tools discussed include R, Spark, Python scientific stack, and custom pre-processing scripts but the focus is on using these with the document database.
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015NoSQLmatters
Sometimes we need to step back and take a look at the bigger picture - not just counting huge piles of individual log records, but reasoning about the behaviors of the people who are ultimately generating this firehose of data. While your DevOps folks care deeply about log records from a machine utlization perspective, marketing wants to know what these records tell us about the customers' needs. Elasticsearch Aggregations are a great feature but are not a panacea. We can happily use them to summarise complex things like the number of web requests per day broken down by geography and browser type on a busy website, but we would quickly run out of memory if we tried to calculate something as simple as a single number for the average duration of visitor web sessions when using the very same dataset. Why does this occur? A web session duration is an example of a behavioural attribute not held on any one log record; it has to be derived by finding the first and last records for each session in our weblogs, requiring some complex query expressions and a lot of memory to connect all the data points. We can maintain a more useful joined-up-picture if we run an ongoing background process to fuse related events from one index into ?entity-centric? summaries in another index e.g: • Web log events summarised into ?web session? entities • Road-worthiness test results summarised into ?car? entities • Reviews in a marketplace summarised into a ?reviewer? entity Using real data, this session will demonstrate how to incrementally build entity-centric indexes alongside event-centric indexes by using simple scripts to uncover interesting behaviours that accumulate over time. We'll explore: • Which cars are driven long distances after failing roadworthiness tests? • Which website visitors look to be behaving like ?bots?? • Which seller in my marketplace has employed an army of ?shills? to boost his feedback rating? Attendees will leave this session with all the tools required to begin building entity-centric indexes and using that data to derive richer business insights across every department in their organization.
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...NoSQLmatters
Relevance and Personalization is crucial to building personalized local commerce experience at Groupon. Talk covers overview of the real time analytics infrastructure that handles over 3 million events/ second and stores and scales to billions of data points. Solution covers how our Kafka -> Storm -> Redis/ HBase pipeline is used to generate real time analytics for hundreds of millions of users of Groupon. Solution includes various architecture design choices and tradeoffs including some interesting algorithmic choices such as Bloom Filters & Hyper Log Log. Attendees can take away learnings from our real-life experience that can help them understand various tuning methods, their tradeoffs and apply them in their solutions
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...NoSQLmatters
Building applications on streaming data has its challenges. If you are trying to use programs such as Apache Spark or Storm to build applications, this presentation will explain the advantages and disadvantages of each solution and how to choose the right tool for your next streaming data project. Building streaming data applications that can manage the massive quantities of data generated from mobile devices, M2M, sensors and other IoT devices, is a big challenge that many organizations face today. Traditional tools, such as conventional database systems, do not have the capacity to ingest data, analyze it in real-time, and make decisions. New technologies such as Apache Spark and Storm are now coming to the forefront as possible solutions to handing fast data streams. Typical technology choices fall into one of three categories: OLAP, OLTP, and stream-processing systems. Each of these solutions has its benefits, but some choices support streaming data and application development much better than others. Employing a solution that handles streaming data, provides state, ensures durability, and supports transactions and real-time decisions is key to benefitting from fast data. During this presentation you will learn: - The difference between fast OLAP, stream-processing, and OLTP database solutions. - The importance of state, real-time analytics and real-time decisions when building applications on streaming data. - How streaming applications deliver more value when built on a super-fast in-memory, SQL database.
Just a few years ago all software systems were designed to be monoliths running on a single big and powerful machine. But nowadays most companies desire to scale out instead of scaling up, because it is much easier to buy or rent a large cluster of commodity hardware then to get a single machine that is powerful enough. In the database area scaling out is realized by utilizing a combination of polyglot persistence and sharding of data. On the application level scaling out is realized by microservices. In this talk I will briefly introduce the concepts and ideas of microservices and discuss their benefits and drawbacks. Afterwards I will focus on the point of intersection of a microservice based application talking to one or many NoSQL databases. We will try and find answers to these questions: Are the differences to a monolithic application? How to scale the whole system properly? What about polyglot persistence? Is there a data-centric way to split microservices?
Chris Ward - Understanding databases for distributed docker applications - No...NoSQLmatters
In this talk we'll focus on the use of Crate alongside Weave in Docker containers, the technical challenges, best practices learned, and getting a big data application running alongside it. You'll learn about the reasons why Crate.IO is building "yet another NoSQL database" and why it's unique and important when running web scale containerized applications. We'll show why the shared-nothing architecture is so important when deploying large clusters in containers and how it addresses the issues and fears of a Docker-based persistence layer. You will learn how to deploy a Crate cluster in the cloud within minutes using Docker, some of the challenges you'll encounter, and how to overcome them in order to scale your backends efficiently. We focused on super simple integration with any cloud provider, striving it to be as turnkey as possible with minimal up-front configuration required to establish a cluster. Once established, we'll show how to scale the cluster horizontally by simply adding more nodes. The session will also give you examples when you should use Crate compared to other similar technologies such as MongoDB, Hadoop, Cassandra or FoundationDB. We'll talk about this approach's strengths and what types of applications are well-suited for this type of data store, as well what is not. Finally we'll outline how to architect an application that is easy to scale using Crate and Docker.
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...NoSQLmatters
More than two years ago we faced the decision whether to run our MongoDB database on Amazon's EC2 ourselves or to rely on a Database as a Service provider. Common wisdom told us that a well known provider, focusing all its knowledge and energy on running MongoDB, would be a better choice than us trying it on the side. Well, this talk describes what can go wrong, since we have seen a lot of interesting minor and major hiccups — including stopped instances, broken backups, a major security incident, and more broken backups. Additionally, we discuss some reasons why a hosted solution is not always the better choice and which new challenges arise from it.
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...NoSQLmatters
What if we would try to make Elasticsearch SQL 92 compliant (http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt)? This wouldn't serve that much nowadays, you would say. Well, we actually tried to do the exercise and we have some interesting conclusions. While we take Elasticsearch as an example for this "side by side", the issues we are addressing also apply to nosql in general. With this unusual exercise, we take the occasion to compare relational databases / sql with Elasticsearch / nosql on all the levels : functionality, semantics, performance and user experience.
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015NoSQLmatters
There are many frameworks that can offer real time on top of Hadoop. This talk will show you the usage of Pivotal HAWQ and how it is easy to use SQL for querying your Hadoop data. Come and see the power and easy of use that can help you on using the Hadoop ecosystem.
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...NoSQLmatters
Apache Spark is a general data processing framework which allows you perform map-reduce tasks (but not only) in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting alternative to the Hadoop eco-system for both real-time and batch processing. During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo.
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
When deploying your service to Microsoft Azure, you have a number of options in terms of noSQL: you can install databases on Linux or Windows virtual machines by yourself, or via the marketplace, or you can use open source databases available as a service like HBase or proprietary and managed databases like Document DB. After showing these options, we'll show Document DB in more details. This is a noSQL database as a service that stores JSON.
David Pilato - Advance search for your legacy application - NoSQL matters Par...NoSQLmatters
How do you mix SQL and NoSQL worlds without starting a messy revolution?This live coding talk will show you how to add Elasticsearch to your legacy application without changing all your current development habits. Your application will have suddenly have advanced search features, all without the need to write complex SQL code!David will start from a Spring, Hibernate and Postgresql based application and will add a complete integration of Elasticsearch, all live from the stage during his presentation.
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015NoSQLmatters
During this live-coding session, Tugdual will move an old fashion full SQL application (JavaEE) to the new NoSQL world.Using MongoDB, and REST, he will show the benefits of this new architecture: * Easyness * Flexibility * High availability * Scalability; During this presentation, you will learn more about: * Document Oriented Model * JSON * REST * Iterative development; This demonstration is also a good opportunity to see how you can migrate data from a relational database, and the various schema options.
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
How do you monitor performance for one of your clients on a specific user segmentation when dealing with billions of events a day ? With over 2 billion ads served and 230Tb of data processed a day, we at Criteo have a comprehensive need for an interactive analytics stack. And by interactive, we mean a querying system with dynamic filtering to drill down over multiple dimensions, answering within sub-second latency. This session will take you on our journey with Druid, ""an open-source data store designed for real-time exploratory analytics on large data sets"". We will explore Druid's architecture and noticeable concepts, how relevant they are for some use cases and how it really performs.
In many modern applications the database side is realized using polyglot persistence – store each data format (graphs, documents, etc.) in an appropriate separate database. This approach yields several benefits, databases are optimized for their specific duty, however there are also drawbacks: * keep all databases in sync * queries might require data from several databases * experts needed for all used systems A multi-model database is not restricted to one data format, but can cope with several of them. In this talk i will present how a multi-model database can be used in a polyglot persistence setup and how it will reduce the effort drastically.
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015NoSQLmatters
The impact that NoSQL has had on the technology community cannot be overstated. The proliferation of new and exciting data systems has led to a slew of interesting solutions to problems that were once solved the relational way. In this session we explore all that is great and good about NoSQL: the innovative software, the clever storage paradigms and the reigniting of developer interest in data access. It is unfortunate that NoSQL is not only a force for good in our community. We'll explore some of the darker corners of NoSQL: the disregard for years of proven technology, the overbearing hype, the overblown marketing and the ever present arguments over which technology is best. We close the session by exploring what can be done to extract even more value from the NoSQL movement, where we can improve how the community interacts with the larger technology community and what the future holds for data access technologies.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
4. Real-time, data driven app?
• No store and retrieve;
• Store, {transform, enrich, analyse} and retrieve;
• Real-time: retrieve is not a batch process;
• App: something your mother could use:
SELECT attendees
FROM NoSQLMatters
WHERE password = '1234';
9. Challenges
1. Big Data;
2. Privacy;
3. Some real-time analysis;
4. Real-time retrieval.
10. Is it Big Data?
Everybody talks about it
Nobody knows how to do it
Everyone thinks everyone else is doing it, so everyone
claims they’re doing it…
Dan Ariely
11. Is it Big Data?
• Raw logs are in the order of 40TB;
•We use Hadoop for storing, enriching and pre-processing.
19. Data Example
date hour id_activity postcod
e hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
20. helper.py example
def get_statistics(data, sbi):
sbi_df = data[data.sbi == sbi]
# select * from data where sbi = sbi
hits = sbi_df.hits.sum() # select sum(hits) from …
delta_hits = sbi_df.delta.sum() # select sum(delta) from …
if delta_hits:
percentage = (hits - delta_hits) / delta_hits
else:
percentage = 0
return {"sbi": sbi, "total": hits, "percentage": percentage}
21. helper.py example
def get_timeline(data, sbi):
df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum)
# select sum(hits), sum(delta) from data group by date, hour, sbi
return df_sbi
22. Who has my data?
• First iteration was a (pre)-POC, less data (3GB vs
500GB);
• Time constraints;
• Oeps: everything is a pandas df!
23. Advantage of “everything is a df”
Pro:•
Fast!!
• Use what you know
• NO DBA’s!
•We all love CSV’s!
Contra:
• Doesn’t scale;
• Huge startup time;
• NO DBA’s!
•We all hate CSV’s!
24. If you want to go down this path
• Set the dataframe index wisely;
• Align the data to the index:
source_data.sort_index(inplace=True)
• Beware of modifications of the original dataframe!
25. If you want to go down this path
The reason pandas is faster is because I came up with a better algorithm
26. If you don’t
AngularJS python app
REST
JSON
?
Front-end Back-end Database
31. Issues?!
•With a radius of 10km, in Amsterdam, you get
10k postcodes. You need to do this in your SQL:
SELECT * FROM datapoints
WHERE
date IN date_array
AND
postcode IN postcode_array;
• Index on date and postcode, but single queries
running more than 20 minutes.
32. Postgres + Postgis (2.x)
PostGIS is a spatial database extender for PostgreSQL.
Supports geographic objects allowing location queries:
SELECT *
FROM datapoints
WHERE ST_DWithin(lon, lat, 1500)
AND dates IN ('2013-02-30', '2013-02-31');
-- every point within 1.5km
-- from (lat, lon) on imaginary dates
34. How we solved it
1. Align data on disk by date;
2. Use the temporary table trick:
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL
PRIMARY KEY);
INSERT INTO tmp (postcodes) VALUES postcode_array;
SELECT * FROM tmp
JOIN datapoints d
ON d.postcode = tmp.postcodes
WHERE
d.dt IN dates_array;
3. Lose precision: 1234AB→1234
35. Take home messages
1. Geospatial problems are “hard” and cam kill your
queries;
2. Not everybody has infinite resources: be smart
and KISS!
3. SQL or NoSQL? (Size, schema)
36. GoDataDriven
We’re hiring / Questions? / Thank you!
@gglanzani
giovannilanzani@godatadriven.com
Giovanni Lanzani
Data Whisperer