Deep learning algorithms have benefited greatly from the recent performance gains of GPUs. However, it has been unclear whether GPUs can speed up machine learning algorithms such as generalized linear modeling, random forests, gradient boosting machines, and clustering. H2O.ai, the leading open source AI company, is bringing the best-of-breed data science and machine learning algorithms to GPUs.
We introduce H2O4GPU, a fully featured machine learning library that is optimized for GPUs with a robust python API that is drop dead replacement for scikit-learn. We'll demonstrate benchmarks for the most common algorithms relevant to enterprise AI and showcase performance gains as compared to running on CPUs.
Jon’s Bio:
https://umdphysics.umd.edu/people/faculty/current/item/337-jcm.html
Please view the video here:
Hadoop 2.0, and in particular YARN has opened up a lot of potential applications beyond MapReduce. This presentation explains some of the ways this happened, and what you can now do that you couldn't before. It also introduces some new tools (Spark) and infrastructure pieces (Mesos) to achieve even more efficient cluster use.
Aleksei Udatšnõi – Crunching thousands of events per second in nearly real ti...NoSQLmatters
Aleksei Udatšnõi – Crunching thousands of events per second in nearly real time
Imagine you have a product which generates up to 10 thousands events per second or around 1 billion events per day. This live stream of data need to be tracked, processed and presented to end-users in a visually appealing way. The solution needs to be integrated into a traditional web application. That is the real use case at Softonic. In this talk we will show how it was solved in Softonic. We use the stack of technologies around Big Data to process and store live stream of data and present the results to users in nearly real time. This real-life solution is built around Hadoop ecosystem and it includes Flume, Hive, Oozie and Impala. We will show how to store and query such volumes of data using NoSQL database and how to build a scalable end-user web application using nearly real time data feed.
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
Architecting R into the Storm Application Development Process
~~~~~
The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.
In this presentation, Allen will build a bridge from basic real-time business goals to the technical design of solutions. We will take an example of a real-world use case, compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution.
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
Slides from the 2017 FOSS4G Workshop "Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS"
See the repository at https://github.com/lossyrob/foss4g-2017-geopyspark-workshop
Mining and Managing Large-scale Linked Open DataMOVING Project
Linked Open Data (LOD) is about publishing and interlinking data of different origin and purpose on the web. The Resource Description Framework (RDF) is used to describe data on the LOD cloud. In contrast to relational databases, RDF does not provide a fixed, pre-defined schema. Rather, RDF allows for flexibly modeling the data schema by attaching RDF types and properties to the entities. Our schema-level index called SchemEX allows for searching in large-scale RDF graph data. The index can be efficiently computed with reasonable accuracy over large-scale data sets with billions of RDF triples, the smallest information unit on the LOD cloud. SchemEX is highly needed as the size of the LOD cloud quickly increases. Due to the evolution of the LOD cloud, one observes frequent changes of the data. We show that also the data schema changes in terms of combinations of RDF types and properties. As changes cannot capture the dynamics of the LOD cloud, current work includes temporal clustering and finding periodicities in entity dynamics over large-scale snapshots of the LOD cloud with about 100 million triples per week for more than three years.
Deep learning algorithms have benefited greatly from the recent performance gains of GPUs. However, it has been unclear whether GPUs can speed up machine learning algorithms such as generalized linear modeling, random forests, gradient boosting machines, and clustering. H2O.ai, the leading open source AI company, is bringing the best-of-breed data science and machine learning algorithms to GPUs.
We introduce H2O4GPU, a fully featured machine learning library that is optimized for GPUs with a robust python API that is drop dead replacement for scikit-learn. We'll demonstrate benchmarks for the most common algorithms relevant to enterprise AI and showcase performance gains as compared to running on CPUs.
Jon’s Bio:
https://umdphysics.umd.edu/people/faculty/current/item/337-jcm.html
Please view the video here:
Hadoop 2.0, and in particular YARN has opened up a lot of potential applications beyond MapReduce. This presentation explains some of the ways this happened, and what you can now do that you couldn't before. It also introduces some new tools (Spark) and infrastructure pieces (Mesos) to achieve even more efficient cluster use.
Aleksei Udatšnõi – Crunching thousands of events per second in nearly real ti...NoSQLmatters
Aleksei Udatšnõi – Crunching thousands of events per second in nearly real time
Imagine you have a product which generates up to 10 thousands events per second or around 1 billion events per day. This live stream of data need to be tracked, processed and presented to end-users in a visually appealing way. The solution needs to be integrated into a traditional web application. That is the real use case at Softonic. In this talk we will show how it was solved in Softonic. We use the stack of technologies around Big Data to process and store live stream of data and present the results to users in nearly real time. This real-life solution is built around Hadoop ecosystem and it includes Flume, Hive, Oozie and Impala. We will show how to store and query such volumes of data using NoSQL database and how to build a scalable end-user web application using nearly real time data feed.
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
Architecting R into the Storm Application Development Process
~~~~~
The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.
In this presentation, Allen will build a bridge from basic real-time business goals to the technical design of solutions. We will take an example of a real-world use case, compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution.
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
Slides from the 2017 FOSS4G Workshop "Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS"
See the repository at https://github.com/lossyrob/foss4g-2017-geopyspark-workshop
Mining and Managing Large-scale Linked Open DataMOVING Project
Linked Open Data (LOD) is about publishing and interlinking data of different origin and purpose on the web. The Resource Description Framework (RDF) is used to describe data on the LOD cloud. In contrast to relational databases, RDF does not provide a fixed, pre-defined schema. Rather, RDF allows for flexibly modeling the data schema by attaching RDF types and properties to the entities. Our schema-level index called SchemEX allows for searching in large-scale RDF graph data. The index can be efficiently computed with reasonable accuracy over large-scale data sets with billions of RDF triples, the smallest information unit on the LOD cloud. SchemEX is highly needed as the size of the LOD cloud quickly increases. Due to the evolution of the LOD cloud, one observes frequent changes of the data. We show that also the data schema changes in terms of combinations of RDF types and properties. As changes cannot capture the dynamics of the LOD cloud, current work includes temporal clustering and finding periodicities in entity dynamics over large-scale snapshots of the LOD cloud with about 100 million triples per week for more than three years.
Cern, the vast underground laboratory under the mountains bordering France and Switzerland, has taken on some huge challenges – but few could be larger than storing all the data it has picked up.
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
The talk will present a MPI-based extension of the Spark platform developed in the context of light source facilities. The background and rationale of this extension are described in the attached paper “Bringing the HPC reconstruction algorithms to Big Data platforms”[1], which has been presented at New York Scientific Data Summit (NYSDS), August 14-17, 2016 (talk: https://www.bnl.gov/nysds16/files/pdf/talks/NYSDS16%20Malitsky.pdf) Specifically, the paper highlighted a gap between two modern driving forces of the scientific discovery process: HPC and Big Data technologies. As a result, it proposed to extend the Spark platform with inter-worker communication for supporting scientific-oriented parallel applications. The approach was illustrated in the context of the Spark-based deployment of the SHARP MPI/GPU ptychographic solver. Aside from its practical value, this application represents a reference use case that captures the major technical aspects of other reconstruction tasks. In the NYSDS’16 paper, the implemented approach followed the CaffeOnSpark RDMA peer-to-peer model and augmented it with the RDMA address exchange server. By the Spark Summit, we plan to further advance this direction with the Spark-MPI generic solution based on the Hydra process management framework for supporting two major MPI implementations, MPICH and MVAPICH.
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
This talk provides an overview of the open source Storm system for processing Big Data in realtime. The talk starts with an overview of the technology, including key components: Nimbus, Zookeeper, Topology, Tuple, Trident. It looks at integration with Hadoop through YARN and recent improvements. The presentation then dives into the complex Big Data architecture in which Storm can be integrated . The result is a compelling stack of technologies including integrated Hadoop clusters, MPP, and NoSQL databases.
After this, we look at example use cases for Storm: real-time advertising statistics, updating a Machine Learned model for content popularity predictions, and financial compliance monitoring.
Dask Tutorial at PyConDE / PyData Karlsruhe 2018. These were the introductory slides that mainly contain the link to Matthew Rocklin's Dask workshop at PyData NYC 2018 whereon this workshop was based.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
The next generation of the Montage image mosaic engineG. Bruce Berriman
Presentation given by Bruce Berriman at the Astronomical Data Analysis Software & Systems XXV (ADASS XXV) Conference, Sydney, Australia, October 29, 2015.
Authors: G. B. Berriman, J.C. Good, B. Rusholme, T. Robitaille.
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner.
This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way.
- See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf
Using BigBench to compare Hive and Spark (short version)Nicolas Poggi
BigBench is the brand new standard for benchmarking and testing Big Data systems. This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with with their respective 1 and 2 versions under different configurations. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and log-file analysis. The talk concludes with the main findings, the scalability and limits of each framework.
Space Debris are defunct objects in space, including old space vehicles or fragments from collisions. Space debris can cause great damage to functional space ships and satellites. Thus detection of space debris and prediction of their orbital paths are essential. The talk shows a Python based infrastructure for storing space debris data from sensors and high-throughput processing of that data.
PyData Seattle (26. Juli 2015)
http://seattle.pydata.org/schedule/presentation/35/
This is a slide deck that I have been using to present on GeoTrellis for various meetings and workshops. The information is speaks to GeoTrellis pre-1.0 release in Q4 of 2016.
Cern, the vast underground laboratory under the mountains bordering France and Switzerland, has taken on some huge challenges – but few could be larger than storing all the data it has picked up.
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
The talk will present a MPI-based extension of the Spark platform developed in the context of light source facilities. The background and rationale of this extension are described in the attached paper “Bringing the HPC reconstruction algorithms to Big Data platforms”[1], which has been presented at New York Scientific Data Summit (NYSDS), August 14-17, 2016 (talk: https://www.bnl.gov/nysds16/files/pdf/talks/NYSDS16%20Malitsky.pdf) Specifically, the paper highlighted a gap between two modern driving forces of the scientific discovery process: HPC and Big Data technologies. As a result, it proposed to extend the Spark platform with inter-worker communication for supporting scientific-oriented parallel applications. The approach was illustrated in the context of the Spark-based deployment of the SHARP MPI/GPU ptychographic solver. Aside from its practical value, this application represents a reference use case that captures the major technical aspects of other reconstruction tasks. In the NYSDS’16 paper, the implemented approach followed the CaffeOnSpark RDMA peer-to-peer model and augmented it with the RDMA address exchange server. By the Spark Summit, we plan to further advance this direction with the Spark-MPI generic solution based on the Hydra process management framework for supporting two major MPI implementations, MPICH and MVAPICH.
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
This talk provides an overview of the open source Storm system for processing Big Data in realtime. The talk starts with an overview of the technology, including key components: Nimbus, Zookeeper, Topology, Tuple, Trident. It looks at integration with Hadoop through YARN and recent improvements. The presentation then dives into the complex Big Data architecture in which Storm can be integrated . The result is a compelling stack of technologies including integrated Hadoop clusters, MPP, and NoSQL databases.
After this, we look at example use cases for Storm: real-time advertising statistics, updating a Machine Learned model for content popularity predictions, and financial compliance monitoring.
Dask Tutorial at PyConDE / PyData Karlsruhe 2018. These were the introductory slides that mainly contain the link to Matthew Rocklin's Dask workshop at PyData NYC 2018 whereon this workshop was based.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
The next generation of the Montage image mosaic engineG. Bruce Berriman
Presentation given by Bruce Berriman at the Astronomical Data Analysis Software & Systems XXV (ADASS XXV) Conference, Sydney, Australia, October 29, 2015.
Authors: G. B. Berriman, J.C. Good, B. Rusholme, T. Robitaille.
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner.
This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way.
- See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf
Using BigBench to compare Hive and Spark (short version)Nicolas Poggi
BigBench is the brand new standard for benchmarking and testing Big Data systems. This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with with their respective 1 and 2 versions under different configurations. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and log-file analysis. The talk concludes with the main findings, the scalability and limits of each framework.
Space Debris are defunct objects in space, including old space vehicles or fragments from collisions. Space debris can cause great damage to functional space ships and satellites. Thus detection of space debris and prediction of their orbital paths are essential. The talk shows a Python based infrastructure for storing space debris data from sensors and high-throughput processing of that data.
PyData Seattle (26. Juli 2015)
http://seattle.pydata.org/schedule/presentation/35/
This is a slide deck that I have been using to present on GeoTrellis for various meetings and workshops. The information is speaks to GeoTrellis pre-1.0 release in Q4 of 2016.
Geospatial Sensor Networks and Partitioning DataAlexMiowski
We use resources like weather reports or air quality measurements to navigate the world. These resources become especially important when faced by extreme events like the current wildfires in the Western USA. The data for the reports, predictions, and maps all start as realtime sensor networks.
In this presentation, I look at some of my research into scientific data representation on the Web and how the key mechanism is the partitioning, annotation, and naming of data representations. We’ll take a look at a few examples, including some recent work on air quality data relating to the current wildfires in the western USA. We’ll explore the central question of how geospatial sensor network data can be collected and consumed within K8s deployments.
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
An efficient data mining solution by integrating Spark and CassandraStratio
Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.
Managing data analytics in a hybrid cloudKaran Singh
We’ll talk about the changes in the industry that customers are faced with and how Red Hat Hyperconverged Infrastructure can address those challenges . Our customers are struggling not only to manage the growth of big data (structured and unstructured), but also to reap timely business insights from their data using their existing data infrastructure like monolithic Hadoop clusters. This often leads to alternative approaches that often lead to disappointing results.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
This webinar focuses on introducing Scalding for developers and writing applications for Hadoop and YARN using Scalding. Guest speaker Jonathan Coveney from Twitter provides an overview, use cases, limitations, and core concepts.
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...Amazon Web Services
The PanCancer Analysis of Whole Genomes (PCAWG) project is a large-scale, highly distributed research collaboration designed to identify common patterns of mutations across 2,800 cancer genomes. The use of public and private clouds were instrumental in analyzing this dataset using current best practice containerized pipelines. This session describes the technical infrastructure built for the project, how we leveraged cloud environments to perform the “core” analysis, and the lessons learned along the way.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
Following the popularity of “Cloud Revolution: Exploring the New Wave of Serverless Spatial Data,” we’re thrilled to announce this much-anticipated encore webinar.
In this sequel, we’ll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR.
Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios.
Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects.
Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you’re building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel par Cédric Carbone
-Spark vs Hadoop MapReduce (& Hadoop v2 vs Hadoop v1)
-Spark Streaming vs Storm
-Le Machine Learning avec Spark
-Use case métier : NextProductToBuy
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. Directly computing against public and
research cloud object stores
OCEANS AND ATMOSPEHERE
Paul Branson | UWA/CSIRO Joint Post-doc
8 May 2019
2. Pangeo on HPC
• About me
• Disclaimer/Acknowledgements
• Examples of :
• Quick setup of Pangeo for HPC (dask-jobqueue)
• Intake-thredds to AODN THREDDS server, s3fs to AODN AWS S3 bucket
• Xarray with Geoviews+Holoviews for visualisation
• Converting netCDF to Zarr
• Benchmarks of various ways to compute against AODN data directly
• Doing some science (sort of!)
2 | Directly computing against public and research cloud object stores| Paul Brannson
3. • Coastal Physical Oceanographer
• Lots of numerical modelling – results are typically dense nD-arrays in netCDF
format
• PHD at UWA Studying shallow island wakes at laboratory scale
• Developer 3DPIV code for scalable analysis on Pawsey
About me
3 | Directly computing against public and research cloud object stores| Paul Brannson
4. • From 9 cameras and 140TB of images
• Each instantaneous velocity field (of up to 98,000 vectors) requires
approximately 4.9 million 3D FFTs (of ~150,000 points)
• 50 experiments of ~10,000 frames to a 32kb figure
Presentation title | Presenter name
Some serious dimensionality reduction
4 |
Branson PM, Ghisalberti M,
Ivey GN (2018) Three-
dimensionality of shallow island
wakes, Journal of
Environmental Fluid Mechanics
Branson PM, Ghisalberti M,
Ivey GN, Hopfiger EJ (2019,
accepted) Cylinder wakes in
shallow oscillatory flow: the
coastal island wake problem,
Journal of Fluid Mechanics
5. • XArray – Labelled nDimensional arrays
• Dask – scaling out analysis of netCDF datafiles using
dask-jobqueue on Pawsey HPC
Problems like tidal phase-aligning results of experiments, subsetting
and aggregating, calculating differential quantities of vector fields
etc etc
and of course the rest of the Python data stack (Numpy, Scipy, Matplotlib)
And the final data analysis stage was made considerably easier
with Pangeo
5 | Directly computing against public and research cloud object stores| Paul Brannson
6. Note: None of this is my work!
All self taught from the openness of the Pangeo community:
• Ryan Abernathey @rabernat
• Matthew Rocklin @mrocklin
• Joe Hamman @jhamman
• Stephan Hoyer @shoyer
• Martin Durant @martindurant
• Anderson Banihirwe @andersy005
• Scott Henderson @scottyhq
Acknowledgements
Pangeo Community pangeo.io github.com/pangeo-data
6 | Directly computing against public and research cloud object stores| Paul Brannson
7. More details guide available here:
http://pangeo.io/setup_guides/hpc.html
TL;DR version:
git clone https://github.com/pbranson/c3dis-2019-pangeo.git
cd c3dis-2019-pangeo
conda env create –f environment.yaml
conda activate pyAODN
cp jobqueue.yaml ~/.config/dask/
sbatch start_pangeo.sh
ssh -N -l pbranson -L 8888:z043:8888 zeus.pawsey.org.au
Setup of dask-jobqueue on HPC
7 | Directly computing against public and research cloud object stores| Paul Brannson
8. scheduler:
work-stealing: True
allowed-failures: 5
distributed:
worker:
memory:
target: 0.6 # Avoid spilling to disk
spill: 0.7 # Avoid spilling to disk
pause: 0.80 # fraction at which we pause worker threads
terminate: 0.95 # fraction at which we terminate the worker
jobqueue:
slurm:
cores: 10
memory: 120GB
processes: 5
queue: workq
project: pawsey0106
walltime: 0-2:00:00
~/.config/dask/jobqueue.yaml
8 |
jobqueue:
slurm:
cores: 4
memory: 12GB
processes: 2
queue: workq
project: pawsey0106
walltime: 0-2:00:00
OR
Directly computing against public and research cloud object stores| Paul Brannson
9. • Since a few weeks ago we can (thanks to @andersy005, @martindurant)
Intake-thredds
9 | Directly computing against public and research cloud object stores| Paul Brannson
10. the to_dask() is a bit of a hangover that needs refactoring
Intake-thredds – to Xarray
10 | Directly computing against public and research cloud object stores| Paul Brannson
11. Access the underlying AODN AWS S3 bucket
11 | Directly computing against public and research cloud object stores| Paul Brannson
12. Only working since a few weeks ago also… h5netcdf==0.7.1, s3fs==0.2.1
And open a netCDF file directly from S3 (also works for google)
12 | Directly computing against public and research cloud object stores| Paul Brannson
13. Presentation title | Presenter name
Viewing remote sensing data in the browser with
Holoviews+Geoviews
13 |
16. Only working since a few weeks ago also… h5netcdf==0.7.1, s3fs==0.2.1
Example converting to Zarr
16 | Directly computing against public and research cloud object stores| Paul Brannson
17. • Benchmark by calculating monthly mean of 12 months of daily Australia
wide files (reduction of 365 x 10001 x 7001 element, 100GB array)
monthly_mean = ds_thredds.groupby('time.month').mean('time').compute()
1. netCDFs via THREDDS
2. netCDFs via S3 directly
3. netCDFs via Lustre filesystem
4. Zarr via Lustre
5. Zarr via S3
• Benchmarks conducted from Pawsey with 20 workers across 4 nodes
So lets test some of these things out!
17 | Directly computing against public and research cloud object stores| Paul Brannson
18. Results
18 | Directly computing against public and research cloud object stores| Paul Brannson
19. • Stay tuned!
• On AARNET – presentation yesterday by Gavin Kennedy
• CloudStor Service – Re-engineering CloudStor for Infinite Scalability
• S3 service using minio server
• Scalable service available June 2019
• On Pawsey (from a little birdie)
• S3 compliant object store procurement commencing second half 2019
• But Pawsey/CSIRO have some phat pipes so you can work directly
on AWS S3….
Research cloud object stores
19 | Directly computing against public and research cloud object stores| Paul Brannson
20. • Data volumes are going up exponentially
• Eventually insufficient storage to make mirror for personal use (I know of
some datasets replicated 5/6 times on Pawsey)
• But bandwidth also keeps going up
• So computing against object stores from HPC seems viable… ONLY if
your datasets are in cloud-optimised format
• So in the context of FAIR data, seems that whilst the defacto standard of
netCDF(HDF) as a data format fails the Interoperability criteria in
practice
• https://github.com/pbranson/c3dis-2019-pangeo.git
Conclusions
20 | Directly computing against public and research cloud object stores| Paul Brannson
21. Oceans and Atmosphere
Paul Branson
UWA/CSIRO Post-Doc
e paul.branson@csiro.au
Thank you
OCEANS AND ATMOSPHERE