Real time data driven applications (SQL vs NoSQL databases)GoDataDriven
Content and talk by Giovani Lanzani (GoDataDriven) at No-SQL Matters talk in Dublin (September 2014).
Big Data: Everybody talks about it, nobody knows how to do it. Everyone else thinks someone else is doing it, so claims to be doing it.
Giovanni covers what real time data driven applications are, presents one of the app build for one of GoDataDriven customers, what challenges arose and what database helped GoDataDriven achieve the level of performance they wanted.
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
This is a presentation made on the 13th August 2014 at the SF Data Mining Meetup at Trulia. It's about Dataiku and the Kaggle Personalized Web Search Ranking challenge sponsored by Yandex
Open Source Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You’ll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Real time data driven applications (SQL vs NoSQL databases)GoDataDriven
Content and talk by Giovani Lanzani (GoDataDriven) at No-SQL Matters talk in Dublin (September 2014).
Big Data: Everybody talks about it, nobody knows how to do it. Everyone else thinks someone else is doing it, so claims to be doing it.
Giovanni covers what real time data driven applications are, presents one of the app build for one of GoDataDriven customers, what challenges arose and what database helped GoDataDriven achieve the level of performance they wanted.
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
This is a presentation made on the 13th August 2014 at the SF Data Mining Meetup at Trulia. It's about Dataiku and the Kaggle Personalized Web Search Ranking challenge sponsored by Yandex
Open Source Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You’ll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data.
Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.
Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances.
Building Better Models Faster Using Active LearningCrowdFlower
Active learning is an increasingly popular technique for rapidly iterating the construction of machine learning models, exploiting the fact that the current state of the model can be used to predict which additional examples will be the most informative. Active learning is appealing for two main reasons: it optimizes ongoing human involvement in the model building process, and it helps overcome the negative effects of imbalanced training data. In this talk, Nick explains how active learning helps overcome common obstacles to building successful models, and also offers a peek into how CrowdFlower's new active learning based offering, CrowdFlower AI.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses.
https://www.bigdataspain.org/2017/talk/fishing-graphs-in-a-hadoop-data-lake
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
Heating solution using Panstamp and PythonOriol Rius
Small presentation about how I redesigned the heating system at home.
Presentation's video: http://youtu.be/NA2fmZYHfmw
Python Meetup Barcelona - 25/September/2014
Content and talk by Giovani Lanzani (GoDataDriven) at SEA Amsterdam in November 2014. Real time data driven applications using Python and pandas as backend
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...PyData
Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to be able to automate your analyses to personalize your users experiences. LinkedIn's People you May Know, Netflix and Pandora's recommenders, and Amazon's eerily custom shopping experience have all shown us why it is essential to leverage data if you want to stay relevant as a company.
As data analyses turn into products, it is essential that your tech/data stack be flexible enough to run models in production, integrate with web applications, and provide users with immediate and valuable feedback. I believe Python is becoming the lingua franca of data science due to its flexibility as a general purpose performant programming language, rich scientific ecosystem (numpy, scipy, scikit-learn, pandas, etc.), web frameworks/community, and utilities/libraries for handling data at scale. In this talk I will walk through a fictional company bringing it's first data product to market. Along the way I will cover Python and data science best practices for such a pipeline, cover some of the pitfalls of what happens when you put models into production, and how to make sure your users (and engineers) are as happy as they can be.
https://github.com/Jay-Oh-eN/pydatasv2014
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks
The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data.
Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.
Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances.
Building Better Models Faster Using Active LearningCrowdFlower
Active learning is an increasingly popular technique for rapidly iterating the construction of machine learning models, exploiting the fact that the current state of the model can be used to predict which additional examples will be the most informative. Active learning is appealing for two main reasons: it optimizes ongoing human involvement in the model building process, and it helps overcome the negative effects of imbalanced training data. In this talk, Nick explains how active learning helps overcome common obstacles to building successful models, and also offers a peek into how CrowdFlower's new active learning based offering, CrowdFlower AI.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses.
https://www.bigdataspain.org/2017/talk/fishing-graphs-in-a-hadoop-data-lake
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
Heating solution using Panstamp and PythonOriol Rius
Small presentation about how I redesigned the heating system at home.
Presentation's video: http://youtu.be/NA2fmZYHfmw
Python Meetup Barcelona - 25/September/2014
Content and talk by Giovani Lanzani (GoDataDriven) at SEA Amsterdam in November 2014. Real time data driven applications using Python and pandas as backend
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...PyData
Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to be able to automate your analyses to personalize your users experiences. LinkedIn's People you May Know, Netflix and Pandora's recommenders, and Amazon's eerily custom shopping experience have all shown us why it is essential to leverage data if you want to stay relevant as a company.
As data analyses turn into products, it is essential that your tech/data stack be flexible enough to run models in production, integrate with web applications, and provide users with immediate and valuable feedback. I believe Python is becoming the lingua franca of data science due to its flexibility as a general purpose performant programming language, rich scientific ecosystem (numpy, scipy, scikit-learn, pandas, etc.), web frameworks/community, and utilities/libraries for handling data at scale. In this talk I will walk through a fictional company bringing it's first data product to market. Along the way I will cover Python and data science best practices for such a pipeline, cover some of the pitfalls of what happens when you put models into production, and how to make sure your users (and engineers) are as happy as they can be.
https://github.com/Jay-Oh-eN/pydatasv2014
Google Analytics vs. Omniture Comparative GuideJimmy Jay
Google Analytics Vs Omniture Comparative Guide is a clear way to differentiate between two available web analytics applications. This guide is based on the basic as well as complex features of both the platforms.
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...NoSQLmatters
Giovanni Lanzani – SQL & NoSQL databases for data driven applications
For data to be the fuel of the 21th century, and for data science to live up to its promise as adriver of innovation, their application should not be confined to dashboards and static analyses.Instead they should be the driver of real applications that support the organisations that own orgenerates the data. Most of these applications are web-based and require real-time access to thedata. However, many Big Data analyses and tools are inherently batch-driven and not well suited forsecure, real-time and performance-critical connections with applications. Trade-offs become ofteninevitable, especially when mixing multiple tools and data sources.In this talk we will describe our journey to build a data driven application at a large Dutchfinancial institution. We will dive into the issues we faced, our considerations and the technicalchoices we made in order to perform data analyses but also drive a web-based, real-timeapplications. We considered and used Impala, Hbase, and MongoDB, but also conventional SQL databasessuch as MySQL and PostgreSQL. Important aspects in our journey were, among others, the handling ofgeographical data, the access to hundreds of millions of records as well as the real time analysisof millions or data points.
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
Everybody wants to go on the “Big Data” hype cycle, “To do Scale”, to use the coolest tools in the market like Hadoop, Apache Spark, Apache Cassandra, etc.
But do they ask themselves is there really a reason for that?
In the talk we’ll make a brief overview to all of the technologies in the Big Data world nowadays and we’ll talk about the problems that really emerge when you’d like to enter the great world of Big Data handling.
Showing you the Hadoop ecosystem and Apache Spark and all of the distributed tools leading the market today, will give you all a notion of what will be the real costs entering that world.
Promise that I’ll share some stories from the trenches :)
(And about the “pool” thing...I don’t really know how to swim)
You did great job finishing this web app on time and budget. Design patterns, good code coverage, cutting edge frameworks and best CI ever. It goes to production and boom, clients complain it's too slow. They don't really care, if it's best engineering ever, if each view loads 4 seconds. My presentation will give you hints on how to look for bottlenecks. I will also share simple tricks to make the app work faster, or at least seem to work faster.
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
In this session it will presented main workflows and technologies of getting value from Big Data stored in our Enterprise using Azure.
- When we have a Big Data problem
- Finding the best solution for our Big Data
- Working inside the Data Team
- Extract the true value of our data.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
So your boss says you need to learn data scienceSusan Ibach
Interested in Data science but trying to get a handle on all the terms getting you confused? Not sure where to start? This presentation breaks down the concepts and the terminology
Pre-Aggregated Analytics And Social Feeds Using MongoDBRackspace
Jon Hyman, co-founder and CIO of Appboy, an engagement platform for mobile apps, highlights how it solved issues around pre-aggregated analytics and used statistical formulas on top of the aggregation framework to return results in real time as its data grew. And Greg Avola, co-founder and developer at Untappd, a social network for beer lovers, discussed how MongoDB and ObjectRocket helped Untappd address problems with serving its social feed and how it sustained high performance up to 5,000 to 6,000 queries per second, and used location indexes to enable geo-location search.
This is my intro to MongoDB talk presented to the Miami MongoDB User Group in February 2015. It's a pretty high level talk, mainly geared for folks that have not used it before.
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
From a student to an apache committer practice of apache io tdbjixuan1989
This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
Lambda architecture for real time big dataTrieu Nguyen
Lambda Architecture in Real-time Big Data Project
Concepts & Techniques “Thinking with Lambda”
Case study in some real projects
Why lambda architecture is correct solution for big data?
How to create a Devcontainer for your Python projectGoDataDriven
Prevent mis-aligned environments between developers, onboard new-joiners faster, and reduce the time it takes to take your project to production. Sounds interesting? Devcontainers can help you with this. Devcontainers allow you to connect your IDE to a running Docker container and develop inside it. This gives you all the benefits of reproducibility that Docker is known for. In this talk, I will walk you through what Devcontainers are, why they might be useful for you, and how to create one for your Python project using VSCode.
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...GoDataDriven
Many machine learning models we use today have the core assumption that our data needs to be tabular, but how often is this truly the case? What if our data points are not independent? By ignoring the potential interrelatedness of our data, do we lose meaningful information that our models cannot leverage? In this talk, we shall explore graph neural networks and highlight how they can solve interesting problems in a way that is intractable when limiting ourselves to using tabular data. We will look at the limitations of common algorithms and highlight how some clever linear algebra enables us to incorporate more meaningful information into our models. Social network data is a popular example of where relationships are relevant but relationships exist in many types of data where it may not be so obvious. Whether it's e-commerce, logistics or molecular data, relationships within your data likely exist and making use of them can be incredibly powerful. This talk will hopefully spark your curiosity and provide you with a way of looking at problems from a new angle. It is intended for anyone with an interest in machine learning and will only lightly touch on some technical details.
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022GoDataDriven
Time-series data is all around us: from logistics to digital marketing, from pricing to stock markets - it’s hard to imagine a modern business that has no time series data to forecast. However, mastering such forecasting is not an easy task. For this talk, we have collected a list of common time series issues that digital fortune tellers commonly run into. You will learn how to identify, understand and resolve them better. This will include stabilising divergent time series, handling outliers without anomaly propagation, reducing the impact of noise and more.
MLOps CodeBreakfast on AWS - GoDataFest 2022GoDataDriven
During the MLOps CodeBreakfast, we will be giving an introduction to MLOps. After this introduction, we will go into more detail on how to implement and deploy a Machine Learning pipeline on both Azure and AWS.
MLOps CodeBreakfast on Azure - GoDataFest 2022GoDataDriven
During the MLOps CodeBreakfast, we will be giving an introduction to MLOps. After this introduction, we will go into more detail on how to implement and deploy a Machine Learning pipeline on both Azure and AWS.
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022GoDataDriven
In this talk, we will compare the most widely used BI tools in the market from the perspective of a mature data organization. The focus of this talk WON’T be on flashy features nor superficial sales talk. We will compare both tools in terms of how well they fit in with DataOps best practices. How do they rank in terms of speed of delivery, governance, robustness, and analytical capabilities.
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022GoDataDriven
Deploy your own modern data stack using open source components usingTerraform cloud-agnostic tooling. By leveraging open-source components you can deploy a state-of-the-art modern data platform in a day. What are the pro's and con's of “build-it-yourself" in the data+analytics space?
AWS Well-Architected Webinar Security - Ben de HaanGoDataDriven
The security pillar encompasses the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies. This presentation will provide in-depth, best-practice guidance for architecting secure systems on AWS.
The 7 Habits of Effective Data Driven CompaniesGoDataDriven
1. Start searching use cases with value & impact: without use cases, nobody will want to draft a data strategy
Where do you want to go? Draft a clear Customer Experience that you want to create and think about the organization & data strategy to get there!
2. Get Tech (data scientists and engineers) and Business (Product Management & Commercial) on the same table: create a solid foundation.
3. Start with communities of practice to learn & experiment together and build the capability.
4. Stop talking about data. Start experimenting and doing.
5. Product Management needs to get real about data. (start training these capabilities)
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...GoDataDriven
The typical organizational model is that teams are in constant flux, are created for work, are only responsible for the change and are not empowered, or lack trust, to run products. A high performance organization model allows teams to take full responsibility for cost, compliance and security, and lets them own their own incidents. This improves quality, change failure rates, lower costs and leads to more happy employees. DevOps is about creating with the end in mind, cross-functional autonomous teams and end-tn-end responsibility. You build it, you run it. You break it, you fix it. This means you want to automate everything in a CI/CD pipeline. Roll-forward, don't roll-back. DevOps principles play an important role in a data-driven maturity model. Continuous prototyping and a data mindset and skills for everybody. In a Data Science Workflow combining input data and deriving the model features usually requires the most of the work, and lots of iterations before its done. Implement features one-by-one. So, start with a baseline model and compare this against more complex models, to see if additional complexity is worth the performance gain. The result of a data scientist is a trained model. Such a model contains 4 components: input data, derived features, chosen model type and hyperparameters. A trained model is always the combination of data and the code. So where do you run this trained model? Model management is versioning code but not the data. A model management server stores hyperparameters, performance metrics, metadata, trained models. IN a data science pipeline, we have two components for deployment: the application and the trained model. So we split the pipeline into parts: a build pipeline, a train pipeline and a deploy pipeline. A complete pipeline mapped to azure components would look largely like this: An Azure DevOps Build pipeline, an Azure ML Training pipeline and an Azure DevOps Release pipeline.
Artificial intelligence in actions: delivering a new experience to Formula 1 ...GoDataDriven
At GoDataFest 2019, Guy Kfir presented how AI delivers a new experience to Formula 1 fans across the world. AWS fuels the analytics through machine learning. Did you know a Formula 1 race car contains 120 sensors and generated 3 GB of data every race at 1,500 data points per second? AWS developed several applications, including overtake possibility, pitstop advantage. How important is it for your company to invest in Machine Learning and AI? There are three scenario's for AI/ML success: Automation, Enrichment and Invention. So, what are you waiting for: create the loop, advance your data strategy and organize for succes. To get started identify AI/ML use cases, educate yourself, start with AI services and move to Amazon Sagemaker, engage with AWS, consider the partner eco system (like GoDataDriven or Binx).
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofGoDataDriven
During GoDataFest 2019, Rens Weijers, manager data & strategy and Peter van ' t Hof, data engineer, share the story of how Vattenfall develops smart applications on Azure. Vattenfall has the ambition to transition to fossil-free living within one generation. But what about decentral energy solutions in the Customers & Solutions business unit? Data is key to help customers to reduce their CO2 footprint. Azure enables Vattenfall to be personal and relevant towards customers.
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019GoDataDriven
Every company today is talking about AI/ML, but when most companies talk about AI/ML in their transformation journey, you hear terms like Proof of Concept, Feasibility Study, Pilot, A/B Test. We are at the peak of AI's hype, but only 12% of enterprises have deployed AI in production. Google aims to make big data processing available for everyone, the possiblities of Big Query ML are endless: Marketing, retail, industrial and IoT, media, gaming, and so fort.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
4. Real-time, data driven app?
•No store and retrieve;
•Store, {transform, enrich, analyse} and retrieve;
•Real-time: retrieve is not a batch process;
•App: something your mother could use:
SELECT attendees
FROM NoSQLMatters
WHERE password = '1234';
10. Is it Big Data?
Everybody talks about it
Nobody knows how to do it
Everyone thinks everyone else is doing it, so everyone
claims they’re doing it…
Dan Ariely
11. Is it Big Data?
•Raw logs are in the order of 40TB;
•We use Hadoop for storing, enriching and pre-
processing.
19. date hour id_activity
postcod
e
hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
Data Example
20. helper.py example
def get_statistics(data, sbi):
sbi_df = data[data.sbi == sbi]
# select * from data where sbi = sbi
hits = sbi_df.hits.sum() # select sum(hits) from …
delta_hits = sbi_df.delta.sum() # select sum(delta) from …
if delta_hits:
percentage = (hits - delta_hits) / delta_hits
else:
percentage = 0
return {"sbi": sbi, "total": hits, "percentage": percentage}
21. helper.py example
def get_timeline(data, sbi):
df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum)
# select sum(hits), sum(delta) from data group by date, hour, sbi
return df_sbi
22. Who has my data?
•First iteration was a (pre)-POC, less data (3GB vs
500GB);
•Time constraints;
•Oeps: everything is a pandas df!
23. Advantage of “everything is a df”
Pro:
•Fast!!
•Use what you know
•NO DBA’s!
•We all love CSV’s!
Contra:
•Doesn’t scale;
•Huge startup time;
•NO DBA’s!
•We all hate CSV’s!
24. •Set the dataframe index wisely;
•Align the data to the index:
•Beware of modifications of the original dataframe!
source_data.sort_index(inplace=True)
If you want to go down this path
25. The reason pandas is faster is because I came up with a better algorithm
If you want to go down this path
31. Issues?!
•With a radius of 10km, in Amsterdam, you get
10k postcodes.You need to do this in your SQL:
•Index on date and postcode, but single queries
running more than 20 minutes.
SELECT * FROM datapoints
WHERE
date IN date_array
AND
postcode IN postcode_array;
32. PostGIS is a spatial database extender for PostgreSQL.
Supports geographic objects allowing location queries:
SELECT *
FROM datapoints
WHERE ST_DWithin(lon, lat, 1500)
AND dates IN ('2013-02-30', '2013-02-31');
-- every point within 1.5km
-- from (lat, lon) on imaginary dates
Postgres + Postgis (2.x)
34. How we solved it
1. Align data on disk by date;
2. Use the temporary table trick:
3. Lose precision: 1234AB→1234
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL
PRIMARY KEY);
INSERT INTO tmp (postcodes) VALUES postcode_array;
SELECT * FROM tmp
JOIN datapoints d
ON d.postcode = tmp.postcodes
WHERE
d.dt IN dates_array;
35. Take home messages
1. Geospatial problems are “hard” and cam kill your
queries;
2. Not everybody has infinite resources: be smart
and KISS!
3. SQL or NoSQL? (Size, schema)
36. GoDataDriven
We’re hiring / Questions? / Thank you!
@gglanzani
giovannilanzani@godatadriven.com
Giovanni Lanzani
Data Whisperer