This document discusses design patterns for big data applications. It begins by defining what a design pattern is and providing examples from architecture and software design. It then analyzes characteristics of big data applications to determine appropriate patterns, including volume, velocity, variety, and more. Common patterns are presented like percolation, recommendation, and encapsulated processes. Examples include personalized search, medicine, and market segmentation. The document concludes that applying the right patterns can improve productivity, performance, and maintainability of big data systems.
HORTONWORKS DATA PLATFORM AND IBM SYSTEMS – A COMPLETE SOLUTION FOR COGNITIVE BUSINESS
SynerScope has been helping European organizations across industries unlock competitive business value from data for almost a decade. Now, by leveraging state-of-the-art access control and audit mechanisms from Hortonworks combined with the latest generation high-performance computing and storage solutions from IBM, SynerScope can connect and correlate enterprise data at a scale not previously possible. SynerScope will demonstrate end-to-end analytics workflows including deep-learning based automation using new integrated solutions from Hortonworks and IBM.
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
In 2015/16 Worldpay deployed it's Enterprise Data Platform - a highly secure cluster used for analysis of over 65 Billion card transactions and the subject of last years Hadoop Summit Keynote in Dublin. A year on and we are now rapidly expanding our platform with true multi-tenancy. For our first tenant we have build and deployed the analytics and reporting for our central platforms. Our second tenant is to deploy 'decision engines' into our core business systems. These allow Worldpay to make decisions derived from machine learning on how we authorise and route payments traffic and how these affect the consumer, merchant and other business partners. We are also developing other tenant for systems management and security. This talk will look at what it means to have truly have a single enterprise data lake and multiple tenants that share that data and look forward to how we will extend the platform in 2017 with Hadoop 3.
Genomics applications like the Genome Analysis Toolkit (GATK) have long used techniques like MapReduce to parallelize I/O, but have never before run on Hadoop. We will describe what we did to build an end-to-end GATK-based genome analysis pipeline on Hadoop, show how it scaled at lower platform cost, and demonstrate the results.
We’re in the midst of an exciting paradigm shift in terms of how we process events data in real time to better react to business opportunities or risk. To stay ahead of your competition, you need the ability to react to business-critical events as they happen. These critical events are created through diverse sources such as social interaction, machine sensors, or a customer transaction. How can you understand the meaning and context of these events that ultimately define your business?
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...MapR Technologies
Big data presents both enormous challenges and incredible opportunities for companies in today’s competitive environment. To deal with the rapid growth of global data, companies have turned to Hadoop to help them with performing real-time search, obtaining fast and efficient analytics, and predicting behaviors and trends. In this session, we’ll demonstrate how we successfully leveraged Hadoop and its ecosystem components to build a converged data infrastructure to meet these needs.
HORTONWORKS DATA PLATFORM AND IBM SYSTEMS – A COMPLETE SOLUTION FOR COGNITIVE BUSINESS
SynerScope has been helping European organizations across industries unlock competitive business value from data for almost a decade. Now, by leveraging state-of-the-art access control and audit mechanisms from Hortonworks combined with the latest generation high-performance computing and storage solutions from IBM, SynerScope can connect and correlate enterprise data at a scale not previously possible. SynerScope will demonstrate end-to-end analytics workflows including deep-learning based automation using new integrated solutions from Hortonworks and IBM.
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
In 2015/16 Worldpay deployed it's Enterprise Data Platform - a highly secure cluster used for analysis of over 65 Billion card transactions and the subject of last years Hadoop Summit Keynote in Dublin. A year on and we are now rapidly expanding our platform with true multi-tenancy. For our first tenant we have build and deployed the analytics and reporting for our central platforms. Our second tenant is to deploy 'decision engines' into our core business systems. These allow Worldpay to make decisions derived from machine learning on how we authorise and route payments traffic and how these affect the consumer, merchant and other business partners. We are also developing other tenant for systems management and security. This talk will look at what it means to have truly have a single enterprise data lake and multiple tenants that share that data and look forward to how we will extend the platform in 2017 with Hadoop 3.
Genomics applications like the Genome Analysis Toolkit (GATK) have long used techniques like MapReduce to parallelize I/O, but have never before run on Hadoop. We will describe what we did to build an end-to-end GATK-based genome analysis pipeline on Hadoop, show how it scaled at lower platform cost, and demonstrate the results.
We’re in the midst of an exciting paradigm shift in terms of how we process events data in real time to better react to business opportunities or risk. To stay ahead of your competition, you need the ability to react to business-critical events as they happen. These critical events are created through diverse sources such as social interaction, machine sensors, or a customer transaction. How can you understand the meaning and context of these events that ultimately define your business?
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...MapR Technologies
Big data presents both enormous challenges and incredible opportunities for companies in today’s competitive environment. To deal with the rapid growth of global data, companies have turned to Hadoop to help them with performing real-time search, obtaining fast and efficient analytics, and predicting behaviors and trends. In this session, we’ll demonstrate how we successfully leveraged Hadoop and its ecosystem components to build a converged data infrastructure to meet these needs.
Pouring the Foundation: Data Management in the Energy IndustryDataWorks Summit
At CenterPoint Energy, both structured and unstructured data are continuing to grow at a rapid pace. This growth presents many opportunities to deliver business value and many challenges to control costs. To maximize the value of this data while controlling costs, CenterPoint Energy created a data lake using SAP HANA and Hadoop. During this presentation, CenterPoint will discuss their journey of moving smart meter data to Hadoop, how Hadoop is allowing CenterPoint to derive value from big data and their future use case road map.
Introducing Big Data concepts & Hadoop to those who wish to begin their journey in the future of Information Technology. It is certain that data is going to play a major role in days to come, from our daily lives to the biggest of the venture we might undertake.And hence, knowing about Big data and technologies to work with the same is going to be essential for IT professionals.
We will start with some simple presentations and then will go on building upon it. For more intense, focused introduction & training on Big Data and related technologies, visit our website or write to us.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
"Real World Use Cases: Hadoop and NoSQL in Production" by Tugdual Grall.
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
This slide was created to present the result of my paper about "A Study Review of Common Big Data Architecture for Small-Medium Enterprise" at MSCEIS FPMIPA Universitas Pendidikan Indonesia 2019.
In cooperate with: https://www.linkedin.com/in/faijinali and https://www.linkedin.com/in/fajriabdillah
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
LendingClub RealTime BigData Platform with Oracle GoldenGate BigData Adapter. This was presented at Oracle Open World 2017 at San Francisco.
Speaker :
Rajit Saha
Vengata Guruswami
NoSQL Application Development with JSON and MapR-DBMapR Technologies
NoSQL databases are being used everywhere by startups and Global 2000 companies alike for data environments that require cost-effective scaling. These environments also typically need to represent data in a more flexible way than is practical with relational databases.
We're introducing MapR Streams, a reliable, global event streaming system that connects data producers and data consumers across shared topics of information. With the integration of MapR Streams, comes the industry’s first and only converged data platform that integrates file, database, event streaming, and analytics to accelerate data-driven applications and address emerging IoT needs.
Are you ready to accelerate your business with the power of a truly global platform for integrating data-in-motion with data-at-rest?
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
Apache Metron (Incubating) is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable
advanced analytics through machine learning and data science to the
users. Because of the relative immaturity of data science platform
infrastructure integrated into Hadoop that is oriented to streaming
analytics applications, we have been forced to create the requisite
platform components out of necessity, utilizing many of the pieces of
the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and
how it utilizes a custom data science model deployment and autodiscovery
service that is tightly integrated with Hadoop via Yarn and Zookeeper.
We will discuss how we interact with the models deployed there via a
custom domain specific language that can query models as data streams
past. We will generally discuss the full-stack data science tooling that
has been created to enable data science at scale on an advanced analytics
streaming application.
Predicting failure in power networks, detecting fraudulent activities in payment card transactions, and identifying next logical products targeted at the right customer at the right time all require machine learning around massive data sets. This form of artificial intelligence requires complex self-learning algorithms, rapid data iteration for advanced analytics and a robust big data architecture that’s up to the task.
Learn how you can quickly exploit your existing IT infrastructure and scale operations in line with your budget to enjoy advanced data modeling, without having to invest in a large data science team.
How big data and AI saved the day: critical IP almost walked out the doorDataWorks Summit
Cybersecurity threats have evolved beyond what traditional SIEMs and firewalls can detect. We present case studies highlighting how:
•An advanced manufacturer was able to identify new insider threats, enabling them to protect their IP
•A media company’s security operations center was able to verify they weren’t the source of a high-profile media leak.
The common thread across these real-world case studies is how businesses can expand their threat analysis using security analytics powered by artificial intelligence in a big data environment.
Cybersecurity threats increasingly require the aggregation and analysis of multiple data sources. Siloed tools and technologies serve their purpose, but can’t be applied to look across the ever-growing variety and volume of traffic. Big data technologies are a proven solution to aggregating and analysing data across enormous volumes and varieties of data in a scalable way. However, as security professionals well know, more data doesn’t mean more leads or detection. In fact, all too often more data means slower threat hunting and more missed incidents. The solution is to leverage advanced analytical methods like machine learning.
Machine learning is a powerful mathematical approach that can learn patterns in data to identify relevant areas to focus. By applying these methods, we can automatically learn baseline activity and detect deviations across all data sources to flag high-risk entities that behave differently from their peers or past activity. ROY WILDS, Principal Data Scientist, Interset
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...DataWorks Summit
Even after deploying traditional security measures like authentication and authorization to secure sensitive data, data owners and security teams are still struggling to manage and get visibility on risks with data. The same challenge multiplies when data is moving and shared across different data silos such as on-premise Hadoop, public cloud infrastructures such as AWS, Azure and Google Cloud. To control the risks that come with data, enterprises need a comprehensive data-centric approach to easily identify risks, manage security and compliance policies and implement behavior analytics to differentiate between good and bad behavior. This talk will explain a 3 step process of implementing data-centric controls for your hybrid environment including discovering where sensitive data is stored, tracking where data is moving and can easily identifying and controlling potential misuse of the data in near real time.
Pouring the Foundation: Data Management in the Energy IndustryDataWorks Summit
At CenterPoint Energy, both structured and unstructured data are continuing to grow at a rapid pace. This growth presents many opportunities to deliver business value and many challenges to control costs. To maximize the value of this data while controlling costs, CenterPoint Energy created a data lake using SAP HANA and Hadoop. During this presentation, CenterPoint will discuss their journey of moving smart meter data to Hadoop, how Hadoop is allowing CenterPoint to derive value from big data and their future use case road map.
Introducing Big Data concepts & Hadoop to those who wish to begin their journey in the future of Information Technology. It is certain that data is going to play a major role in days to come, from our daily lives to the biggest of the venture we might undertake.And hence, knowing about Big data and technologies to work with the same is going to be essential for IT professionals.
We will start with some simple presentations and then will go on building upon it. For more intense, focused introduction & training on Big Data and related technologies, visit our website or write to us.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
"Real World Use Cases: Hadoop and NoSQL in Production" by Tugdual Grall.
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
This slide was created to present the result of my paper about "A Study Review of Common Big Data Architecture for Small-Medium Enterprise" at MSCEIS FPMIPA Universitas Pendidikan Indonesia 2019.
In cooperate with: https://www.linkedin.com/in/faijinali and https://www.linkedin.com/in/fajriabdillah
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
LendingClub RealTime BigData Platform with Oracle GoldenGate BigData Adapter. This was presented at Oracle Open World 2017 at San Francisco.
Speaker :
Rajit Saha
Vengata Guruswami
NoSQL Application Development with JSON and MapR-DBMapR Technologies
NoSQL databases are being used everywhere by startups and Global 2000 companies alike for data environments that require cost-effective scaling. These environments also typically need to represent data in a more flexible way than is practical with relational databases.
We're introducing MapR Streams, a reliable, global event streaming system that connects data producers and data consumers across shared topics of information. With the integration of MapR Streams, comes the industry’s first and only converged data platform that integrates file, database, event streaming, and analytics to accelerate data-driven applications and address emerging IoT needs.
Are you ready to accelerate your business with the power of a truly global platform for integrating data-in-motion with data-at-rest?
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
Apache Metron (Incubating) is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable
advanced analytics through machine learning and data science to the
users. Because of the relative immaturity of data science platform
infrastructure integrated into Hadoop that is oriented to streaming
analytics applications, we have been forced to create the requisite
platform components out of necessity, utilizing many of the pieces of
the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and
how it utilizes a custom data science model deployment and autodiscovery
service that is tightly integrated with Hadoop via Yarn and Zookeeper.
We will discuss how we interact with the models deployed there via a
custom domain specific language that can query models as data streams
past. We will generally discuss the full-stack data science tooling that
has been created to enable data science at scale on an advanced analytics
streaming application.
Predicting failure in power networks, detecting fraudulent activities in payment card transactions, and identifying next logical products targeted at the right customer at the right time all require machine learning around massive data sets. This form of artificial intelligence requires complex self-learning algorithms, rapid data iteration for advanced analytics and a robust big data architecture that’s up to the task.
Learn how you can quickly exploit your existing IT infrastructure and scale operations in line with your budget to enjoy advanced data modeling, without having to invest in a large data science team.
How big data and AI saved the day: critical IP almost walked out the doorDataWorks Summit
Cybersecurity threats have evolved beyond what traditional SIEMs and firewalls can detect. We present case studies highlighting how:
•An advanced manufacturer was able to identify new insider threats, enabling them to protect their IP
•A media company’s security operations center was able to verify they weren’t the source of a high-profile media leak.
The common thread across these real-world case studies is how businesses can expand their threat analysis using security analytics powered by artificial intelligence in a big data environment.
Cybersecurity threats increasingly require the aggregation and analysis of multiple data sources. Siloed tools and technologies serve their purpose, but can’t be applied to look across the ever-growing variety and volume of traffic. Big data technologies are a proven solution to aggregating and analysing data across enormous volumes and varieties of data in a scalable way. However, as security professionals well know, more data doesn’t mean more leads or detection. In fact, all too often more data means slower threat hunting and more missed incidents. The solution is to leverage advanced analytical methods like machine learning.
Machine learning is a powerful mathematical approach that can learn patterns in data to identify relevant areas to focus. By applying these methods, we can automatically learn baseline activity and detect deviations across all data sources to flag high-risk entities that behave differently from their peers or past activity. ROY WILDS, Principal Data Scientist, Interset
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...DataWorks Summit
Even after deploying traditional security measures like authentication and authorization to secure sensitive data, data owners and security teams are still struggling to manage and get visibility on risks with data. The same challenge multiplies when data is moving and shared across different data silos such as on-premise Hadoop, public cloud infrastructures such as AWS, Azure and Google Cloud. To control the risks that come with data, enterprises need a comprehensive data-centric approach to easily identify risks, manage security and compliance policies and implement behavior analytics to differentiate between good and bad behavior. This talk will explain a 3 step process of implementing data-centric controls for your hybrid environment including discovering where sensitive data is stored, tracking where data is moving and can easily identifying and controlling potential misuse of the data in near real time.
First-passage percolation on random planar mapsTimothy Budd
Recently two- and three-point functions have been derived for general planar maps with control
over both the number of edges and number of faces. In the limit of large number of edges, the multi-point
functions reduce to those for random cubic planar maps with random exponential edge lengths, and they
can be interpreted in terms of either a First passage percolation (FPP) or an Eden model. We observe a
surprisingly simple relation between the asymptotic first passage time, the hop count (the number of edges
in a shortest-time path) and the graph distance (the number of edges in a shortest path). Using (heuristic)
transfer matrix arguments, we show that this relation remains valid for random p-valent maps for any p>2.
Slides zu einem 10 Minuten Vortrag zum Thema Werte. Es geht um die dunkle Seite der Macht in uns allen, die uns magisch anzieht und wirklichen Erfolg verhindert. Es geht darum, diese Verhaltensweise bei sich selbst, bei mir selbst zu erkennen.
Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on...Shu Tanaka
Our paper entitled “Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on Square Lattice" was published in Journal of the Physical Society of Japan. This work was done in collaboration with Dr. Ryo Tamura (NIMS).
http://journals.jps.jp/doi/abs/10.7566/JPSJ.82.053002
NIMSの田村亮さんとの共同研究論文 “Network-Growth Rule Dependence of Fractal Dimension of Percolation Cluster on Square Lattice" が Journal of the Physical Society of Japan に掲載されました。
http://journals.jps.jp/doi/abs/10.7566/JPSJ.82.053002
Machine Learning and Logging for Monitoring Microservices Daniel Berman
In this talk I go over the use cases for using machine learning and centralized logging for monitoring a distributed, multi layered microservices architecture.
Scalable and Reliable Logging at PinterestKrishna Gade
At Pinterest, hundreds of services and third-party tools that are implemented in various programming languages generate billions of events every day. To achieve scalable and reliable low latency logging, there are several challenges: (1) uploading logs that are generated in various formats from tens of thousands of hosts to Kafka in a timely manner; (2) running Kafka reliably on Amazon Web Services where the virtual instances are less reliable than on-premises hardware; (3) moving tens of terabytes data per day from Kafka to cloud storage reliably and efficiently, and guaranteeing exact one time persistence per message.
In this talk, we will present Pinterest’s logging pipeline, and share our experience addressing these challenges. We will dive deep into the three components we developed: data uploading from service hosts to Kafka, data transportation from Kafka to S3, and data sanitization. We will also share our experience in operating Kafka at scale in the cloud.
Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a...Shu Tanaka
Our paper entitled “Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a Stacked Triangular Lattice with Competing Interactions" was published in Physical Review E. This work was done in collaboration with Dr. Ryo Tamura (NIMS).
http://pre.aps.org/abstract/PRE/v88/i5/e052138
NIMSの田村亮さんとの共同研究論文 “Interlayer-Interaction Dependence of Latent Heat in the Heisenberg Model on a Stacked Triangular Lattice with Competing Interactions" が Physical Review E に掲載されました。
http://pre.aps.org/abstract/PRE/v88/i5/e052138
BigData & Supply Chain: A "Small" IntroductionIvan Gruer
In the frame of the master in logistic LOG2020, a brief presentation about BigData and its impacts on Supply Chains at IUAV.
Topics and contents have been developed along the research for the MBA final dissertation at MIB School of Management.
Advanced Analytics and Machine Learning with Data Virtualization (India)Denodo
Watch full webinar here: https://bit.ly/3dMN503
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python, and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spend most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Watch this session to learn how companies can use data virtualization to:
- Create a logical architecture to make all enterprise data available for advanced analytics exercise
- Accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- Integrate popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonSynerzip
Making AI real-time to meet mission-critical system demands put a new spin on your architecture. To deliver AI-based applications that will scale as your data grows takes a new approach where the data doesn’t become the bottleneck. We all know that the deeper the data the better the results and the lower the risk. However, doing thousands of computations on big data requires new data structures and messaging to be used together to deliver real-time AI. During this session will look at real reference architectures and review the new techniques that were needed to make AI Real-Time.
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageSteven Ramage
Some initial considerations and discussion points around geospatial big data. Location adds context and relevance. Need to consider a number of V factors including Value.
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Cloudera, Inc.
PRGX is the world's leading provider of accounts payable audit services and works with leading global retailers. As new forms of data started to flow into their organizations, standard RDBMS systems were not allowing them to scale. Now, by using Talend with Cloudera Enterprise, they are able to acheive a 9-10x performance benefit in processing data, reduce errors, and now provide more innovative products and services to end customers.
Watch this webinar to learn how PRGX worked with Cloudera and Talend to create a high-performance computing platform for data analytics and discovery that rapidly allows them to process, model, and serve massive amount of structured and unstructured data.
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"
As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?
Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
Watch full webinar here: https://bit.ly/35FUn32
Presented at CDAO New Zealand
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python, and Scala put advanced techniques at the fingertips of the data scientists.
However, most architecture laid out to enable data scientists miss two key challenges:
- Data scientists spend most of their time looking for the right data and massaging it into a usable format
- Results and algorithms created by data scientists often stay out of the reach of regular data analysts and business users
Watch this session on-demand to understand how data virtualization offers an alternative to address these issues and can accelerate data acquisition and massaging. And a customer story on the use of Machine Learning with data virtualization.
Watch full webinar here: https://bit.ly/3mdj9i7
You will often hear that "data is the new gold"? In this context, data management is one of the areas that has received more attention from the software community in recent years. From Artificial Intelligence and Machine Learning to new ways to store and process data, the landscape for data management is in constant evolution. From the privileged perspective of an enterprise middleware platform, we at Denodo have the advantage of seeing many of these changes happen.
In this webinar, we will discuss the technology trends that will drive the enterprise data strategies in the years to come. Don't miss it if you want to keep yourself informed about how to convert your data to strategic assets in order to complete the data-driven transformation in your company.
Watch this on-demand webinar as we cover:
- The most interesting trends in data management
- How to build a data fabric architecture?
- How to manage your data integration strategy in the new hybrid world
- Our predictions on how those trends will change the data management world
- How can companies monetize the data through data-as-a-service infrastructure?
- What is the role of voice computing in future data analytic
Genome Analysis Pipelines, Big Data Style
Allen Day, Chief Scientist, MapR
Powerful new tools exist for processing large volumes of data quickly across a cluster of networked computers. Typical bioinformatics workflow requirements are well-matched to these tools' capabilities. However, the tool Spark, for example, is not commonly used because many legacy bioinformatics applications make assumptions about their computing environment. These assumptions present a barrier to integrating the tools into more modern computing environments. Fortunately, these barriers are quickly coming down. In this presentation, we'll examine a few operations common to many bioinformatics pipelines, show how they were usually implemented in the past, and how they're being re-implemented right now to save time, money, and make new types of analysis possible. Some code examples will also be provided.
Video Presentation:
https://youtu.be/iwgfjHiHr7Q
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
Deep learning has enabled dramatic advances in image recognition performance. In this talk I will discuss using a deep convolutional neural network to detect genetic variation in aligned next-generation sequencing human read data. Our method, called DeepVariant, both outperforms existing genotyping tools and generalizes across genome builds and even to other species. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
Spark is a powerful new tool for processing large volumes of data quickly across a cluster of networked computers.
Typical bioinformatics workflow requirements are well-matched to Spark’s capabilities. However, Spark is not commonly used because many legacy bioinformatics applications make assumptions about their computing environment. These assumptions present a barrier to integrating the tools into more modern computing environments.
These barriers are quickly coming down. ADAM is a software library and set of tools built on top of Spark that make it easy work with file formats commonly used for genome analysis like FastQ, BAM, and VCF.
In this presentation, we’ll explore how a step that is common to many bioinformatics workflows, sequence alignment, can done with Bowtie and ADAM inside a Spark environment to quickly align short reads to a reference genome. A complete code example is demonstrated and provided at https://github.com/allenday/spark-genome-alignment-demo
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
Personalized medicine holds much promise to improve the quality of human life.
However, personalizing medicine depends on genome analysis software that does not scale well. Given the potential impact on society, genomics takes first place among fields of science that can benefit from Hadoop.
A single human genome contains about 3 billion base pairs. This is less than 1 gigabyte of data but the intermediate data produced by a DNA sequencer, required to produce a sequenced human genome, is many hundreds of times larger. Beyond the huge storage requirement, deep genomic analysis across large populations of humans requires enormous computational capacity as well.
Interestingly enough, while genome scientists have adopted the concept of MapReduce for parallelizing I/O, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally.
The science driving genomic analyses is rapidly changing, but the operational problems of processing data from DNA sequencers quickly and reliably are not new.
I present an analysis of the parallels in the fundamental limiting components of the '90s internet boom and the DNA sequencing boom that is currently underway, and illustrate how Hadoop, a proven application architecture used widely in BigData and commercial internet applications can be reused in the genomics sector.
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
Renaissance in Medicine: Next-Generation Big Data Workloads
Instead of using 1s and 0s (base2), biological software is encoded as A, T, C, and G (base4). DNA sequencers are simply devices for converting information encoded in base4 to base2. Improvements in DNA sequencing technology are happening at a rate that outstrips even Moore’s Law of Computing. As a result, the number of human genomes converted to base2 and uploaded for analysis is rapidly increasing.
Medicine is undergoing a renaissance made possible by analyzing and creating insights from this huge and growing number of genomes. Personalized medicine is simply the practical application of these insights.
In this session, I will show how ETL and MapReduce can be applied in a clinical setting. I will also show how NoSQL and advanced analytics can be used to “reverse engineer” the genetic causes of disease. Such information can be used to predict and prevent individual suffering, as well as to increase the overall health of a society.
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
First draft for upcoming Hadoop World presentation "Renaissance in Medicine" that gives an overview of the upcoming changes in medical practice that are enabled by BigData technologies. Specific algorithmic techniques are detailed that enable this use case.
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
Architecting R into the Storm Application Development Process
~~~~~
The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.
In this presentation, Allen will build a bridge from basic real-time business goals to the technical design of solutions. We will take an example of a real-world use case, compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution.
Q: Can I simply hire one rockstar data scientist to cover all this kind of work?
A: No, interdisciplinary work requires teams
A: Hire leads who can speak the lingo of each required discipline
A: Hire individual contributors who cover 2+ roles, when possible
Statistical Thinking – Solve the Whole Problem
BONUS: Meta Organization – Integration with Adjacent Teams
Co-authors Allen Day @allenday and Paco Nathan @pacoid
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
Talk track: In market segmentation, you want to identify useful segments of your customer base to target for a market campaign, for retention, for specific product offerings, etc. What makes “good” segments depends on what you want to do and how the environment changes. You may not know ahead of time what categories make useful segments. One way to find this is to capture customer histories and do a clustering step for discovery and definition of the market segments.This market segment db is then queried and updated in response to new real-time data insertion or new rounds of clustering. Specific feature extraction may also be a useful step from the customer history persistence layer.
Talk track: the feature extraction step could be triggered by real-time data insertion…
Talk track: a second percolator processes new customer histories relative to the market segments.
Talk track: the clustering step is not triggered by the real-time insertion; it is a scheduled step and thus not an example of percolation.What about the other use case we said was similar, the Genotyping?
Here, we trigger updates to the persona index based on EITHERUpdates to persona history, ORUpdates to the document indexThe idea here being that if enough docs have changed or personas are finding “unusual” stuff, the persona is stale and we should recompute it
Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.
Best practice: use one column family per percolator to manage their independent i/o characteristicsPrevent i/o storms
Talk track: Now let’s consider the other health data example, genome sequencing for personalized medicine. This is an approach that can be used to get the particular genomic characteristics of a cancerous tumor and compare to known patient histories in order to select the best option for a customized therapy.
Talk track: While percolation is not used in this example, it does represent a specialized form of recommendation: user-based recommendation.In this genome sequencing/ personalized medicine example, A very high bar is set for the accuracy of the recommendation. Here a user-based pattern is best. Let’s look at the generalized form…
Talk track: here is the basic pattern for user-based recommendation, as used in the real use case of personalized medicine. In contrast, In consumer recommendation for shopping or movie or music recommendation, rapid response is key and accuracy is slightly less important. There item-based recommendation is generally best, because the expensive step in computing co-occurrence can be done offline prior to a user query.
Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.