GPUs should complement, not replace, the Hadoop ecosystem for big data workloads. Replacing the entire big data stack would be too costly. The presenter believes GPUs are best suited for accelerated computation and a few other use cases to gain an initial foothold in the market. Existing Python interfaces to machine learning frameworks rely too heavily on network communication and serialization, introducing significant overhead. Nd4j and Jumpy provide alternatives that use direct C++ interfaces and pointers for lower latency between Python and deep learning operations on CPU and GPU.
Recent presentation on deeplearning4j's new features as well as some underused features of the AI framework like arbiter,datavec's transform process and libnd4j.
Self driving computers active learning workflows with human interpretable ve...Adam Gibson
Human in the loop learning workflows leveraging deep learning to group and cluster data. Also, techniques for accounting for machine learning failures.
Anomaly Detection and Automatic Labeling with Deep LearningAdam Gibson
Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization.
Recent presentation on deeplearning4j's new features as well as some underused features of the AI framework like arbiter,datavec's transform process and libnd4j.
Self driving computers active learning workflows with human interpretable ve...Adam Gibson
Human in the loop learning workflows leveraging deep learning to group and cluster data. Also, techniques for accounting for machine learning failures.
Anomaly Detection and Automatic Labeling with Deep LearningAdam Gibson
Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization.
Deploying signature verification with deep learningAdam Gibson
Presentation covered building a signature verification system and deploying it to production. This includes resources usage as well as how the model was picked.
Meetup held in Tokyo with Deep learning Otemachi.
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...Databricks
Deep Learning has become ubiquitous with abundance of data, commoditization of compute and storage. Pre-trained models are readily available for many use-cases. Distributed Inference has many applications such as pre-computing results offline, backfilling historic data with predictions from state-of-the-art models, etc.Inference on large scale datasets comes with many challenges prevalent in distributed data processing.
Attendees will learn how to efficiently run deep learning prediction on large data sets, leveraging Apache Spark and Apache MXNet (incubating).
In this session, we’ll cover core Deep Learning Concepts such as:
Types of Learning, a) Supervised Learning b) Unsupervised Learning c) Active Learning d) Reinforcement Learning
Supervised Learning types – classification, regression, Image classification
Types of Neural Networks – Feed forward Networks, CNNs, RNNs, GANs * Apache MXNet(Incubating) Deep Learning Framework. MXNet concepts ie., NDArray, Symbolic APIs and Module APIs. MXNet Gluon APIs * Distributed Inference using Apache MXNet and Apache Spark on Amazon EMR.
In this section, I will cover some of the use-cases of Distributed Inference, the challenges associated with running distributed Inference.
CI/CD for Machine Learning with Daniel KobranDatabricks
What we call the public cloud was developed primarily to manage and deploy web servers. The target audience for these products is Dev Ops. While this is a massive and exciting market, the world of Data Science and Deep Learning is very different — and possibly even bigger. Unfortunately, the tools available today are not designed for this new audience and the cloud needs to evolve. This talk would cover what the next 10 years of cloud computing will look like.
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
Erin LeDell's presentation on Scalable Ensemble Learning with H2O at Strata + Hadoop World San Jose, 03.29.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Stitch Fix aspires to help you find the style that you will love. Data, the backbone of the business, is used to help with styling recommendations, demand modeling, user acquisition, and merchandise planning and also to influence business decisions throughout the organization. These decisions are backed by algorithms and data collected and interpreted based on client preferences. Neelesh Srinivas Salian offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way.
Apache Spark plays an important role in Stitch Fix’s data platform, and the company’s data scientists use Spark for their ETL and Presto for their ad hoc queries. The goal for the team running the compute infrastructure is to understand and make the data scientists’ lives easier, particularly in terms of usability of Spark, by building tools that expedite the process of getting started with Spark and transitioning from an ad hoc to a production workflow. The compute infrastructure is a part of the data platform that is responsible for all the needs of data scientists as Stitch Fix.
Neelesh shares Stitch Fix’s journey, exploring its ad hoc and production infrastructure and detailing its in-house tools and how they work in synergy with open source frameworks in a cloud environment. Neelesh also discusses the additional improvements to the infrastructure that help persist information for future use and optimization and explains how the implementation of Amazon’s EMR FS has helped make it easier to read from the S3 source.
I'll provide guidelines for thinking about empirical performance evaluation of parallel programs in general and of Spark jobs in particular. It's easier to be systematic about this if you think in terms of "what's the effective network bandwidth we're getting?" instead of "How fast does this particular job run?" In addition, the figure of merit for parallel performance isn't necessarily obvious. If you want to minimize your AWS bill you should almost certainly run on a single node (but your job may take six months to finish). You may think you want answers as quickly as possible, but if you could make a job finish in 55 minutes instead 60 minutes while doubling your AWS bill, would you do it? No? Then what exactly is the metric that you should optimize?
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
During my time working on attribution and ingest systems, I've encountered several different approaches to solving the simple question: "How do I get data from A to B". In this session, I'd like to share some of the problems I've encountered and how to effectively solve them.
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
Recommendations for Building Machine Learning Software: Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.
In the slide deck, we describe how graph databases are used at Netflix. Graph databases can be faster than relational databases for deeply-connected data - a strength of the underlying model. We have used JanusGraph on top of Cassandra. Both technologies are Open Source.
Deploying signature verification with deep learningAdam Gibson
Presentation covered building a signature verification system and deploying it to production. This includes resources usage as well as how the model was picked.
Meetup held in Tokyo with Deep learning Otemachi.
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...Databricks
Deep Learning has become ubiquitous with abundance of data, commoditization of compute and storage. Pre-trained models are readily available for many use-cases. Distributed Inference has many applications such as pre-computing results offline, backfilling historic data with predictions from state-of-the-art models, etc.Inference on large scale datasets comes with many challenges prevalent in distributed data processing.
Attendees will learn how to efficiently run deep learning prediction on large data sets, leveraging Apache Spark and Apache MXNet (incubating).
In this session, we’ll cover core Deep Learning Concepts such as:
Types of Learning, a) Supervised Learning b) Unsupervised Learning c) Active Learning d) Reinforcement Learning
Supervised Learning types – classification, regression, Image classification
Types of Neural Networks – Feed forward Networks, CNNs, RNNs, GANs * Apache MXNet(Incubating) Deep Learning Framework. MXNet concepts ie., NDArray, Symbolic APIs and Module APIs. MXNet Gluon APIs * Distributed Inference using Apache MXNet and Apache Spark on Amazon EMR.
In this section, I will cover some of the use-cases of Distributed Inference, the challenges associated with running distributed Inference.
CI/CD for Machine Learning with Daniel KobranDatabricks
What we call the public cloud was developed primarily to manage and deploy web servers. The target audience for these products is Dev Ops. While this is a massive and exciting market, the world of Data Science and Deep Learning is very different — and possibly even bigger. Unfortunately, the tools available today are not designed for this new audience and the cloud needs to evolve. This talk would cover what the next 10 years of cloud computing will look like.
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
Erin LeDell's presentation on Scalable Ensemble Learning with H2O at Strata + Hadoop World San Jose, 03.29.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Stitch Fix aspires to help you find the style that you will love. Data, the backbone of the business, is used to help with styling recommendations, demand modeling, user acquisition, and merchandise planning and also to influence business decisions throughout the organization. These decisions are backed by algorithms and data collected and interpreted based on client preferences. Neelesh Srinivas Salian offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way.
Apache Spark plays an important role in Stitch Fix’s data platform, and the company’s data scientists use Spark for their ETL and Presto for their ad hoc queries. The goal for the team running the compute infrastructure is to understand and make the data scientists’ lives easier, particularly in terms of usability of Spark, by building tools that expedite the process of getting started with Spark and transitioning from an ad hoc to a production workflow. The compute infrastructure is a part of the data platform that is responsible for all the needs of data scientists as Stitch Fix.
Neelesh shares Stitch Fix’s journey, exploring its ad hoc and production infrastructure and detailing its in-house tools and how they work in synergy with open source frameworks in a cloud environment. Neelesh also discusses the additional improvements to the infrastructure that help persist information for future use and optimization and explains how the implementation of Amazon’s EMR FS has helped make it easier to read from the S3 source.
I'll provide guidelines for thinking about empirical performance evaluation of parallel programs in general and of Spark jobs in particular. It's easier to be systematic about this if you think in terms of "what's the effective network bandwidth we're getting?" instead of "How fast does this particular job run?" In addition, the figure of merit for parallel performance isn't necessarily obvious. If you want to minimize your AWS bill you should almost certainly run on a single node (but your job may take six months to finish). You may think you want answers as quickly as possible, but if you could make a job finish in 55 minutes instead 60 minutes while doubling your AWS bill, would you do it? No? Then what exactly is the metric that you should optimize?
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
During my time working on attribution and ingest systems, I've encountered several different approaches to solving the simple question: "How do I get data from A to B". In this session, I'd like to share some of the problems I've encountered and how to effectively solve them.
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
Recommendations for Building Machine Learning Software: Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.
In the slide deck, we describe how graph databases are used at Netflix. Graph databases can be faster than relational databases for deeply-connected data - a strength of the underlying model. We have used JanusGraph on top of Cassandra. Both technologies are Open Source.
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Slides from PyData London exploring how the big data ecosystem (currently) works together as well as how different parts of the ecosystem work with Python. Proof-of-concept examples are provided using nltk & spacy with Spark. Then we look to the future and how we can improve.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
An introduction to Netty. A powerful framework to develop networking applications.
This is suppose to be followed as hands on training, as the exercises on the slides imply, but can be also used an introduction guidance.
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Apache spark on Hadoop Yarn Resource Managerharidasnss
How we can configure the spark on apache hadoop environment, and why we need that compared to standalone cluster manager.
Slide also includes docker based demo to play with the hadoop and spark on your laptop itself. See more on the demo codes and other documentation here - https://github.com/haridas/hadoop-env
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Hana Lee compares some popular options for data engineering work — including Python and Rust, Datafusion and Pandas — to explore the tradeoffs and help determine the ideal stack for your data engineering needs.
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
Sharing of Hadoop cluster deployment experience in production from scratch on real hardware. Brief overview of Hadoop stack, its components, major deployment and configuration challenges, performance tuning and application tuning experience. Some “war stories” about the issues we have faced while operating, the benefits of DevOps approach for running Hadoop apps.
Similar to Strata Beijing 2017: Jumpy, a python interface for nd4j (20)
This talk was on deep learning use cases outside of computer vision. It also covered larger scale patterns of what good deep learning use cases typically look like. We end up on an explanation of anomaly detection and various kinds of anomaly use cases.
Distributed deep rl on spark strata singaporeAdam Gibson
This talk briefly covers deep reinforcemeent learning on spark and the benefits of using large scale commodity compute with gpus for ease of running simulations as well as distributed training for use cases that aren't games such as network intrusion and risk. This talk also briefly mentions rl4j and our work with openai gym.
Deep learning in production with the bestAdam Gibson
Getting deep learning adopted at your company. The current landscape of academia vs industry. Presentation at AI with the best (online conference):
http://ai.withthebest.com/
Strata Beijing - Deep Learning in Production on SparkAdam Gibson
Recent talk at strata beijing - half english half chinese covering use cases of deep learning, deep learning in production and the different components of deeplearning4j.
Gave a talk at:
www.meetup.com/SF-Bayarea-Machine-Learning/events/221739934/
Covers basic architecture of a scientific lib and my take on it with nd4j.
These slides accompanied a demo of Deeplearning4j at the SF Data Mining Meetup hosted by Trulia.
http://www.meetup.com/Data-Mining/events/212445872/
Deep-learning is useful in detecting identifying similarities to augment search and text analytics; predicting customer lifetime value and churn; and recognizing faces and voices.
Deeplearning4j is an infinitely scalable deep-learning architecture suitable for Hadoop and other big-data structures. It includes a distributed deep-learning framework and a normal deep-learning framework; i.e. it runs on a single thread as well. Training takes place in the cluster, which means it can process massive amounts of data. Nets are trained in parallel via iterative reduce, and they are equally compatible with Java, Scala and Clojure. The distributed deep-learning framework is made for data input and neural net training at scale, and its output should be highly accurate predictive models.
The framework's neural nets include restricted Boltzmann machines, deep-belief networks, deep autoencoders, convolutional nets and recursive neural tensor networks.
Finally, Deeplearning4j integrates with GPUs. A stable version was released in October.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Strata Beijing 2017: Jumpy, a python interface for nd4j
1.
2. Who are we?
This slide shows that GPUs should complement the big data stack on the Hadoop ecosystem, rather than trying to
replace Hadoop etc. outright. Wholesale replacement of the big data stack will be cost-prohibitive to many clients. We
believe the right approach is to sell GPUs for accelerated computation and a few other use cases. That’s our beach
head. (Obviously, the widening functionality of the Volta will change the GPU ecosystem.)
Founded 2014
Distributed worldwide
Lots of activity in China
4. Most JVM python interfaces
● Network based. Requires gateway and py4j
● Tons of overhead. Often a bottleneck with real Spark
jobs
● Places a focus on “pushing logic down to scala”
● Doesn’t interop well with existing python ecosystem
● Often api compatibility issues
● “Good enough” for basic use cases despite overhead
5. Basic facts about overhead
● In depth paper: https://arxiv.org/pdf/1612.01437.pdf
● Python vs scala: 15x slower
● Much of this is due to network traffic
● Serialization is another big problem
● Imagine saving objects every time you run compute.
6. Distributed Deep Learning bottlenecks
● Network overhead from param servers
● Data movement between cpu and gpu
● Buffer allocation for compute
● Data Loading and input creation (creating tensors
from data)
7. Linear Algebra in python
● C based internally
● Python is just an interface
● Tend to interop with numpy pointers directly
● Supports cpu and gpu
● For DL often varied engines (MPI,GRPC,..)
● Often extended in C
8. Linear Algebra in spark
● Based on breeze and net lib java (not maintained
anymore, limited to cpu)
● Most routines are Scala based
● On heap memory (bad for latency)
● Cuda support is sparse at best
● Doesn’t conform with industry standards (python)
● Not meant for heavy compute (hardware accel)
● Relies on spark for most ops (you can’t do this with
deep learning)
9. Minor conclusions
● 1 of these is not like the other
● Hard to interop with python ecosystem
● Spark tries to be something it’s not re: linear algebra
● Spark should do data loading. Not linear algebra
better handled by c++ (simd,gpus,..)
● Alternatives are needed (more specialization) (a focus
on c++ with pythonic conventions)
10. Nd4j
● Java based api, c++ core
● Own off heap memory management (even for gpu)
● Soon: Autodiff and graph execution (graph of
operations) and sparse
● Similar architecture to numpy (easy interop)
(http://nd4j.org/userguide)
● Works with blas/lapack
● Generally faster than numpy even from python (as
we’ll see soon)
● It’s not python though!
12. Jumpy: A better python interface
● Low latency using c internally
● Interface with nd4j <-> numpy via direct pointers
● Syntax sugar similar to numpy
● Uses jnius underneath(https://github.com/kivy/pyjnius)
● JNIUS starts and manages a JVM for you. Interops
via JNI and Cython
● Easy to extend
15. Conclusions and future work
● No networks! An actual path to improvement
● Reflection can be a bottleneck
● Like most useful things in python, most of it is c!
● Plans to optimize pyjnius itself
● Can enable us to interop with other parts of python