Implementation details of Sparksee's graph database, learn how bitmaps store graph information and how this result in a lightweight & high-performance solution.
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistSpagoWorld
The presentation supported the speech "Drilling into Data with Apache Drill" by Tugdual Grall (Technical Evangelist, MapR Technologies Inc.) at the HUG Italy meet-up supported by Engineering Group's SpagoBI Labs, which took place in Milan, Italy on March 17th, 2016. Read more: http://bit.ly/1UydNuz
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationCesare Cugnasco
Data visualization can be a tricky problem, even more if the dataset is made of several billions of 3-dimensional particles moving along the time. The talk will focus on some simple indexing and data thinning techniques and how (and how do not) implement them with Cassandra and Spark.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
"Real World Use Cases: Hadoop and NoSQL in Production" by Tugdual Grall.
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
PayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges.
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
In this session, you will learn how CERN easily applied end-to-end deep learning and analytics pipelines on Apache Spark at scale for High Energy Physics using BigDL and Analytics Zoo open source software running on Intel Xeon-based distributed clusters.
Technical details and development learnings will be shared using an example of topology classification to improve real-time event selection at the Large Hadron Collider experiments. The classifier has demonstrated very good performance figures for efficiency, while also reducing the false positive rate compared to the existing methods. It could be used as a filter to improve the online event selection infrastructure of the LHC experiments, where one could benefit from a more flexible and inclusive selection strategy while reducing the amount of downstream resources wasted in processing false positives.
This is part of CERN’s research on applying Deep Learning and Analytics using open source and industry standard technologies as an alternative to the existing customized rule based methods. We show how we could quickly build and implement distributed deep learning solutions and data pipelines at scale on Apache Spark using Analytics Zoo and BigDL, which are open source frameworks unifying Analytics and AI on Spark with easy to use APIs and development interfaces seamlessly integrated with Big Data Platforms.
Implementation details of Sparksee's graph database, learn how bitmaps store graph information and how this result in a lightweight & high-performance solution.
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistSpagoWorld
The presentation supported the speech "Drilling into Data with Apache Drill" by Tugdual Grall (Technical Evangelist, MapR Technologies Inc.) at the HUG Italy meet-up supported by Engineering Group's SpagoBI Labs, which took place in Milan, Italy on March 17th, 2016. Read more: http://bit.ly/1UydNuz
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationCesare Cugnasco
Data visualization can be a tricky problem, even more if the dataset is made of several billions of 3-dimensional particles moving along the time. The talk will focus on some simple indexing and data thinning techniques and how (and how do not) implement them with Cassandra and Spark.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
"Real World Use Cases: Hadoop and NoSQL in Production" by Tugdual Grall.
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
PayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges.
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
In this session, you will learn how CERN easily applied end-to-end deep learning and analytics pipelines on Apache Spark at scale for High Energy Physics using BigDL and Analytics Zoo open source software running on Intel Xeon-based distributed clusters.
Technical details and development learnings will be shared using an example of topology classification to improve real-time event selection at the Large Hadron Collider experiments. The classifier has demonstrated very good performance figures for efficiency, while also reducing the false positive rate compared to the existing methods. It could be used as a filter to improve the online event selection infrastructure of the LHC experiments, where one could benefit from a more flexible and inclusive selection strategy while reducing the amount of downstream resources wasted in processing false positives.
This is part of CERN’s research on applying Deep Learning and Analytics using open source and industry standard technologies as an alternative to the existing customized rule based methods. We show how we could quickly build and implement distributed deep learning solutions and data pipelines at scale on Apache Spark using Analytics Zoo and BigDL, which are open source frameworks unifying Analytics and AI on Spark with easy to use APIs and development interfaces seamlessly integrated with Big Data Platforms.
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
How do we learn and how can we learn better? Educational technology is undergoing a revolution fueled by learning science and data science. The promise is to make a high-quality personalized education accessible and affordable by all. In this presentation Alfred will describe how Apache Spark and Databricks are at the center of the innovation pipeline at McGraw Hill for developing next-generation learner models and algorithms in support of millions of learners and instructors worldwide.
解讀雲端大數據新趨勢
2018-05-16 @ iThome Cloud Summit 2018
雲端運算、大數據、物聯網、人工智慧,這些熱門話題從 2008 年開始就陸續出現在媒體版面上。放眼過去十年 Apache Hadoop 技術在臺灣本土的應用,本次分享將為各位解讀這四個話題之間的關聯,並探討 Big Data Stack on the Cloud 背後的市場需求驅動力,最後分享 Big Data Stack on Kubernetes 的進展。
Fast Cars, Big Data - How Streaming Can Help Formula 1Tugdual Grall
Modern cars produce data. Lots of data. And Formula 1 cars produce more than their share. I will present a working demonstration of how modern data streaming can be applied to the data acquisition and analysis problem posed by modern motorsports.
Instead of bringing multiple Formula 1 cars to the talk, I will show how we instrumented a high fidelity physics-based automotive simulator to produce realistic data from simulated cars running on the Spa-Francorchamps track. We move data from the cars, to the pits, to the engineers back at HQ.
The result is near real-time visualization and comparison of performance and a great exposition of how to move data using messaging systems like Kafka, and process data in real time with Apache Spark, then analyse data using SQL with Apache Drill.
Code available here: https://github.com/mapr-demos/racing-time-series
How Deep Learning Will Make Us More Human Again
While deep learning is taking over the AI space, most of us are struggling to keep up with the pace of innovation. Arno Candel shares success stories and challenges in training and deploying state-of-the-art machine learning models on real-world datasets. He will also share his insights into what the future of machine learning and deep learning might look like, and how to best prepare for it.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions ArchitectSpagoWorld
The presentation supported the speech "Think differently – Stream-based Microservice Architecture for Next-Generation Applications" by Fabian Wilckens (EMEA Solutions Architect, MapR Technologies Inc.) at the HUG Italy meet-up supported by Engineering Group's SpagoBI Labs, which took place in Milan, Italy on March 17th, 2016. Read more: http://bit.ly/1UydNuz
Spark and MapR Streams: A Motivating ExampleIan Downard
Businesses are discovering the untapped potential of large datasets and data streams through the use of technologies for big data processing and storage. By leveraging these assets they’re creating a new generation of applications that derive value from data they used to throw away. In this presentation Ian Downard shows how to build operational environments for these types of applications with the MapR Converged Data Platform and he describes examples of a next-generation applications that use Java APIs for MapR Streams, Apache Spark, Apache Hive, and MapR-DB. He shows how these technologies can be used to join and transform unbounded datasets to find signals and derive new data streams for a financial scenario involving real-time algorithmic trading and historical analysis using SQL. He also discusses how MapR enables you to run real-time data applications with the speed, reliability, and security you need for a production environment.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerStratio
Spark & Cassandra Use Case at Telefónica CyberSecurity (CBS) Antonio Alcocer antonio@stratio.com Oscar Mendez oscar@stratio.com @omendezsoto #CassandraSummit 2014 1
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
Geospatial Analytics at Scale with Deep Learning and Apache SparkDatabricks
"Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images.
In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark’s unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark."
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
We provide an update on developments in the intersection of the R and the broader machine learning ecosystems. These collections of packages enable R users to leverage the latest technologies for big data analytics and deep learning in their existing workflows, and also facilitate collaboration within multidisciplinary data science teams. Topics covered include – MLflow: managing the ML lifecycle with improved dependency management and more deployment targets – TensorFlow: TF 2.0 update and probabilistic (deep) machine learning with TensorFlow Probability – Spark: latest improvements and extensions, including text processing at scale with SparkNLP
The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? I explain what is needed for three important use cases.
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark’s runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability.
In this talk, we introduce a new type of PySpark UDF designed to solve this problem – Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds – the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.
GT.M: A Tried and Tested Open-Source NoSQL DatabaseRob Tweed
GT.M is a tried and tested schema-less "NoSQL" database with a strong pedigree in the highly demanding banking sector. Its free open-source licensing on x86 GNU Linux makes it an excellent alternative to the list of new, largely untested, NoSQL databases.
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
How do we learn and how can we learn better? Educational technology is undergoing a revolution fueled by learning science and data science. The promise is to make a high-quality personalized education accessible and affordable by all. In this presentation Alfred will describe how Apache Spark and Databricks are at the center of the innovation pipeline at McGraw Hill for developing next-generation learner models and algorithms in support of millions of learners and instructors worldwide.
解讀雲端大數據新趨勢
2018-05-16 @ iThome Cloud Summit 2018
雲端運算、大數據、物聯網、人工智慧,這些熱門話題從 2008 年開始就陸續出現在媒體版面上。放眼過去十年 Apache Hadoop 技術在臺灣本土的應用,本次分享將為各位解讀這四個話題之間的關聯,並探討 Big Data Stack on the Cloud 背後的市場需求驅動力,最後分享 Big Data Stack on Kubernetes 的進展。
Fast Cars, Big Data - How Streaming Can Help Formula 1Tugdual Grall
Modern cars produce data. Lots of data. And Formula 1 cars produce more than their share. I will present a working demonstration of how modern data streaming can be applied to the data acquisition and analysis problem posed by modern motorsports.
Instead of bringing multiple Formula 1 cars to the talk, I will show how we instrumented a high fidelity physics-based automotive simulator to produce realistic data from simulated cars running on the Spa-Francorchamps track. We move data from the cars, to the pits, to the engineers back at HQ.
The result is near real-time visualization and comparison of performance and a great exposition of how to move data using messaging systems like Kafka, and process data in real time with Apache Spark, then analyse data using SQL with Apache Drill.
Code available here: https://github.com/mapr-demos/racing-time-series
How Deep Learning Will Make Us More Human Again
While deep learning is taking over the AI space, most of us are struggling to keep up with the pace of innovation. Arno Candel shares success stories and challenges in training and deploying state-of-the-art machine learning models on real-world datasets. He will also share his insights into what the future of machine learning and deep learning might look like, and how to best prepare for it.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions ArchitectSpagoWorld
The presentation supported the speech "Think differently – Stream-based Microservice Architecture for Next-Generation Applications" by Fabian Wilckens (EMEA Solutions Architect, MapR Technologies Inc.) at the HUG Italy meet-up supported by Engineering Group's SpagoBI Labs, which took place in Milan, Italy on March 17th, 2016. Read more: http://bit.ly/1UydNuz
Spark and MapR Streams: A Motivating ExampleIan Downard
Businesses are discovering the untapped potential of large datasets and data streams through the use of technologies for big data processing and storage. By leveraging these assets they’re creating a new generation of applications that derive value from data they used to throw away. In this presentation Ian Downard shows how to build operational environments for these types of applications with the MapR Converged Data Platform and he describes examples of a next-generation applications that use Java APIs for MapR Streams, Apache Spark, Apache Hive, and MapR-DB. He shows how these technologies can be used to join and transform unbounded datasets to find signals and derive new data streams for a financial scenario involving real-time algorithmic trading and historical analysis using SQL. He also discusses how MapR enables you to run real-time data applications with the speed, reliability, and security you need for a production environment.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerStratio
Spark & Cassandra Use Case at Telefónica CyberSecurity (CBS) Antonio Alcocer antonio@stratio.com Oscar Mendez oscar@stratio.com @omendezsoto #CassandraSummit 2014 1
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
Geospatial Analytics at Scale with Deep Learning and Apache SparkDatabricks
"Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images.
In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark’s unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark."
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
We provide an update on developments in the intersection of the R and the broader machine learning ecosystems. These collections of packages enable R users to leverage the latest technologies for big data analytics and deep learning in their existing workflows, and also facilitate collaboration within multidisciplinary data science teams. Topics covered include – MLflow: managing the ML lifecycle with improved dependency management and more deployment targets – TensorFlow: TF 2.0 update and probabilistic (deep) machine learning with TensorFlow Probability – Spark: latest improvements and extensions, including text processing at scale with SparkNLP
The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? I explain what is needed for three important use cases.
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark’s runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability.
In this talk, we introduce a new type of PySpark UDF designed to solve this problem – Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds – the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.
GT.M: A Tried and Tested Open-Source NoSQL DatabaseRob Tweed
GT.M is a tried and tested schema-less "NoSQL" database with a strong pedigree in the highly demanding banking sector. Its free open-source licensing on x86 GNU Linux makes it an excellent alternative to the list of new, largely untested, NoSQL databases.
InfiniteGraph Presentation from Oct 21, 2010 DBTA WebcastInfiniteGraph
Here is the presentation from Warren Davidson, Director of Business Development, and Darren Wood, InfiniteGraph chief architect. The October 21, 2010 webinar hosted by DBTA, with InfiniteGraph and Riptano, covered new data technologies and how the NOSQL ("Not Only SQL") approach is beneficial in addressing some of the more complex application, scalability and performance requirements in handling vast amounts of data, and in performing advanced analytics on those data volumes with greater ease and speed.
Graphs and Artificial Intelligence have long been a focus for Franz Inc. and currently we are collaborating with Montefiore Health System, Intel, Cloudera, and Cisco to improve a patient’s ability to understand the probabilities of their future health status. By combining artificial intelligence, semantic technologies, big data, graph databases and dynamic visualizations we are deploying a Cognitive Probability Graph concept as a means to help predict future medical events.
The power of Cognitive Probability Graphs stems from the capability to combine the probability space (statistical patient data) with a knowledge base of comprehensive medical codes and a unified terminology system. Cognitive Probability Graphs are remarkable not just because of the possibilities they engender, but also because of their practicality. The confluence of machine learning, semantics, visual querying, graph databases, and big data not only displays links between objects, but also quantifies the probability of their occurrence.
We believe this approach will be transformative for the healthcare field and we see numerous possibilities that exist across business verticals.
During the presentation we will describe the Cognitive Probability Graph concepts using a distributed graph database on top of Hadoop along with the query language SPARQL to extract feature vectors out of the data, applying R and SPARK ML, and then returning the results for further graph processing. #AllegroGraph
An introduction to the STINGER dynamic graph structure and analysis package. Shows the motivation for STINGER, what has been done with it, and how you can use it. More at http://cc.gatech.edu/stinger
An overview of InfiniteGraph, the distributed graph databaseInfiniteGraph
InfiniteGraph chief architect, Darren Wood, discusses the history, current use cases and future plans for InfiniteGraph, the distributed graph database that helps enterprise and government teams to build applications that find connections and relationships between countless data objects.
In this security solution demo, we have integrated Oracle NoSQL DB with InfiniteGraph to demonstrate the power of using the right tools for the solution. By integrating the key value technology of Oracle with the InfiniteGraph distributed graph database, we are able to create new views of existing Call Detail Record (CDR) details to enable discovery of connections, paths and behaviors that may otherwise be missed.
Discover how to add value to your existing Big Data to increase revenues and performance!
Objectivity/DB: A Multipurpose NoSQL DatabaseInfiniteGraph
The speakers will describe the flexible configuration possibilities that Objectivity/DB provides, with an emphasis on how best to distribute data across multiple storage nodes. The session will start by describing the distributed processing architecture of Objectivity/DB before covering the new Placement Manager features. The speakers will also describe how Objectivity/DB compares and contrasts with other NoSQL solutions.
NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"DataStax Academy
Comcast is developing a highly scalable cloud DVR scheduling system on top of Cassandra. The system is responsible for managing all DVR data and scheduling logic for devices on the X1 platform. This talk will cover the overall architecture of the scheduling system, data model, message queue and notification software that have been developed as part of this ambitious project. We'll take a deep dive into the details of our data model and review the implementation of Comcast's open-source, Cassandra-based clones of Amazon SQS and SNS.
YouTube Link: https://youtu.be/Vs9k3FThNic
*** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification ***
This Edureka PPT on Big Data Technologies will provide you in-depth knowledge on Big Data Tools. This video will help you understand different types of Big-Data Technologies and trending Big-Data Tools in IT Industries. This PPT covers the following topics:
Agenda
What is Big Data Technology?
Types of Big Data Technologies.
Operational Big Data
Analytical Big Data
Top Big Data Technologies
Big Data Technologies in Storage
Big Data Technologies in Data Mining
Big Data Technologies in Data Analytics
Data Visualization
Emerging Big Data Technologies
Complete Hadoop Playlist: https://goo.gl/ExJdZs
Complete Blog Series: http://bit.ly/2wO8l0y
Big Data Podcast - https://castbox.fm/channel/id1814029
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
.net developer for Jupyter Notebook and Apache Spark and viceversaMarco Parenzan
Jupyter Notebooks and Apache Spark are first class citizens of the Data Science space, a truly requirement for the "modern" data scientist. But there was a requirement: being a python developer. Now Microsoft is investing on C# as another first class citizen in this space. Let's look what .net can do for notebooks and spark and what are notebooks and spark.
Design for X: Exploring Product Design with Apache Spark and GraphLabAmanda Casari
Ideas for designing data science products for: bots, knowledge gaps, IOT + fairness. Combining elements from Apache Spark and Turi's GraphLab products.
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.
From Developer to Data Scientist - Gaines KergosienITCamp
ABSTRACT: Due to recent advances in technology, humanity is collecting vast amounts of data at an unprecedented rate, making the skills necessary to mine insights from this data increasingly valuable. So what does it take for a Developer to enter the world of data science?
Join me on a journey into the world of big data and machine learning where we will explore what the work actually looks like, identify which skills are most important, and design a road map for how you too can join this exciting and profitable industry.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
At Sams Club we have a long history of using Apache Spark and Hadoop. Projects from all parts of the company use Apache Spark, from fraud detection to product recommendations. Because of the scale of our business with billions of transactions and trillions of events it is often essential to use big data technologies. Until recently all of this work has run on several large on-premise Hadoop clusters. As part of our transition to public cloud we needed to build out an enterprise scale data platform. Azure Databricks is a key component of this platform giving our data scientist, engineers, and business users the ability to easily work with the companies data. We will discuss our architecture considerations that lead to using multiple Databricks workspaces and external Azure blob storage. We will also discuss how we move massive amounts of data to Azure on a daily basis with Airflow. Further we will discuss the self-service tools that we created to help users get their data to Azure and for us to manage the platform. Finally we will discuss our security considerations and how that played out in our architecture.
Authors: Andrew Ray, Craig Covey
Large Scale Geospatial Indexing and Analysis on Apache SparkDatabricks
SafeGraph is a data company — just a data company — that aims to be the source of truth for data on physical places. We are focused on creating high-precision geospatial data sets specifically about places where people spend time and money. We have business listings, building footprint data, and foot traffic insights for over 7 million across multiple countries and regions.
In this talk, we will inspect the challenges with geospatial processing, running at a large scale. We will look at open-source frameworks like Apache Sedona (incubating) and its key improvements over conventional technology, including spatial indexing and partitioning. We will explore spatial data structure, data format, and open-source indexing like H3. We will illustrate how all of these fit together in a cloud-first architecture running on Databricks, Delta, MLFlow, and AWS. We will explore examples of geospatial analysis with complex geometries and practical use cases of spatial queries. Lastly, we will discuss how this is augmented by Machine Learning modeling, Human-in-the-loop (HITL) annotation, and quality validation.
Automated Time Series Analysis using Deep Learning, Ray and Analytics ZooJason Dai
Shanghai Apache Spark+AI Online Meetup (https://www.meetup.com/Shanghai-Apache-Spark-AI-Meetup/events/269342169/) on Mar 13, 2020
Topic: Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo (https://github.com/intel-analytics/analytics-zoo)
Speaker: Shan Yu, Intel
This presentation covers architectural principles for Software defined "Everything", Microservices - their impact on Azure, a Geo-Spatial Fleet analysis using Spark and HDFS on ESRI and FlowBasedProgramming
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Sparksee overview
1. *Sparsity Technologies — Powering Extreme Data sparsity-technologies.com
º
Sparksee Graph Database
General Overview
2014
Dàmaris Coll
2. *Sparsity Technologies — Powering Extreme Data sparsity–technologies.com
º
Sparksee Graph Database
!
Introduction!
Sparksee!
Architecture!
Sparksee APIs!
Sparksee mobile !
Index
3. *Sparsity Technologies — Powering Extreme
Data
sparsity–
technologies.com
*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com
Sparksee Graph DatabaseIntroduction
Sparsity Technologies!
!
— Technology based company!
— Technology comes from research at DAMA-UPC, in Barcelona!
— Collaborative work with DAMA-UPC for services and applications!
!
Specialize on!
!
— Managing and querying large graph data!
— Efficient and compressible graph management !
— Social network analytics !
— More than 3 years expertise, ACCESO and Media Planning Group!
— Community search, role discovery, influential !
people!
— Bibliographical databases analytics!
4. *Sparsity Technologies — Powering Extreme
Data
sparsity–
technologies.com
*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com
Sparksee Graph DatabaseSparksee
Sparksee!
!
IS a high-performance and out-of-core graph database
management system!
!
FOR large scale labeled and attributed multigraphs!
!
BASED ON vertical partitioning and collections of objects identifiers
stored as bitmaps
5. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com
º Sparksee Graph DatabaseSparksee
Sparksee — Characteristics!
!
— Graph split into small structures
! Move to main memory just significant parts (caching)
— Object identifiers (oids) instead of complex objects
! Reduce memory requirements
— Specific structures to improve traversals
! Index the edges and the neighbors of each node
— Attribute indices
! Improve queries based on value filters
— Implemented in C++
! Different APIs (Java, .NET, etc.) through wrappers
6. º*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com
º Sparksee Graph DatabaseSparksee
Sparksee — Capabilities!
!
Efficiency
! very compact representation using bitmaps. Highly compressible data ! !
! structures.
Capacity
! more than 100 billion vertices and edges in a single multicore computer.
Performance
! subsecond response in recommendation queries.
Scalability
! high throughput for concurrent queries.
Consistency
! partial transactional support with recovery.
Multiplatform
! Linux, Windows, MacOSX, Mobile
8. *Sparsity Technologies — Powering Extreme
Data
sparsity–
technologies.com
*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com
Sparksee Graph DatabaseSparksee
Sparksee APIs!
!
Desktop!
— Java, .Net (C#), Python and C++ APIs!
— Windows, Linux, Mac OS. 32 and 64bit compilations
— Free download from the website evaluation version limited!
— Development, Startup and Research programs!
— Tailored licenses: Graph size, Sessions and HA functionality!
!
Mobile!
— iOS, Android and BB10!
— C++ for iOS, C++ (NDK) and Java (SDK) for Android, C++ for BB10
— Free download from the website evaluation version limited !
! requires a prior approved license!
— Prices depend on number of final app downloads!
!
!
9. *Sparsity Technologies — Powering Extreme
Data
sparsity–
technologies.com
*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com
Sparksee Graph DatabaseAPI – Hands on
API Class Diagram
1
N 1
1
Objects
1
N
1
N
Set of OIDs
Session GraphDatabaseSparksee
10. *Sparsity Technologies — Powering Extreme
Data
sparsity–
technologies.com
*Sparsity Technologies — Powering Extreme Data sparsity–technologies.com
Sparksee Graph DatabaseAPI – Hands on
API methods!
!
!
com.sparsity.sparksee.gdb!
com.sparsity.sparksee.algorithms!
com.sparsity.sparksee.io (not covered by this seminar)!
com.sparsity.sparksee.scripts (not covered by this seminar)!