NetGuardians is executing it's Big Data Analytics Platform on three key Big Data components underneath: ElasticSearch, Apache Mesos and Apache Spark. This is a presentation of the behaviour of this software stack.
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata RainStor
Live Webcast October 13, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=012bb2c290097165911872b1f241531d
Hadoop data lakes are emerging as peers to corporate data warehouses. However, successful data management solutions require a fusion of all relevant data, new and old, which has proven challenging for many companies. With a data lake that’s been optimized for fast queries, solid governance and lifecycle management, users can take data management to a whole new level.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he discusses the relevance of data lakes in today’s information landscape. He’ll be briefed by Mark Cusack of Teradata, who will explain how his company’s archiving solution has developed into a storage point for raw data. He’ll show how the proven compression, scalability and governance of Teradata RainStor combined with Hadoop can enable an optimized data lake that serves as both reservoir for historical data and as a "system of record” for the enterprise.
Visit InsideAnalysis.com for more information.
Powering Realtime Decision Engines in Finance and Healthcare using Open Sour...Greg Makowski
http://www.kdd.org/kdd2015/industry-gov-talks.html
Financial services and healthcare companies could be the biggest beneficiaries of big data. Their realtime decision engines can be vastly improved by leveraging the latest advances in big data analytics. However, these companies are challenged in leveraging Open Software Systems (OSS). This presentation covers how, in collaboration with financial services and healthcare institutions, we built an OSS project to deliver a realtime decisioning engine for their respective applications. I will address two key issues. First, I will describe the strategy behind our hiring process to attract millennial big data developers and the results of this endeavor. Second, I will recount the collaboration effort that we had with our large clients and the various milestones we achieved during that process. I will explain the goals regarding big data analysis that our large clients presented to us and how we accomplished those goals. In particular, I will discuss how we leveraged open source to deliver a realtime decisioning software product called Kamanja to these institutions. An advantage of developing applications in Kamanja is that it is already integrated with Hadoop, Kafka for realtime data streaming, HBase and Cassandra for NoSQL data storage. I will talk about how these companies benefited from Kamanja and some of challenges we had in the design of this software. I will provide quantifiable improvements in key metrics driven by Kamanja and interesting, unsolved problems/challenges that need to be addressed for faster and wider adoption of OSS by these companies.
Agile, Automated, Aware: How to Model for SuccessInside Analysis
The Briefing Room with David Loshin and Embarcadero
Live Webcast October 27, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/onstage/g.php?MTID=eea9877b71c653c499c809c5693eae8fe
Data management teams face some tough challenges these days. Organizations need business-driven visibility that enables understanding and awareness of enterprise data assets – without worrying about definitions and change management. But with information architectures evolving into a hybrid mix of data objects and data services built over relational databases as well as big data stores, serving up accurately defined, reusable data can become a complex issue.
Register for this episode of The Briefing Room to learn from veteran Analyst David Loshin as he explains the importance of agile, automated workflows in today’s enterprise. He’ll be briefed by Ron Huizenga of Embarcadero, who will discuss how his company’s ER/Studio suite approaches data modeling and management from a modern architecture standpoint. He will explain that unifying the way information is represented can not only eliminate the need for costly workarounds, but also foster collaboration between data architects, developers and business users.
Visit InsideAnalysis.com for more information.
How to Create 80% of a Big Data Pilot ProjectGreg Makowski
When evaluating Open Source Software, or other software of a certain size or complexity, organizations frequently want to conduct a Pilot project, or Proof of Concept (POC). This talk describes a process to reduce the length of the Pilot, by leveraging configurations from performance testing to POC starting configurations.
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsGreg Makowski
This is a first presentation of Kamanja, a new open-source real-time software product, which integrates with other big-data systems. See also links: http://www.meetup.com/SF-Bay-ACM/events/223615901/ and http://Kamanja.org to download, for docs or community support. For the YouTube video, see https://www.youtube.com/watch?v=g9d87rvcSNk (you may want to start at minute 33).
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...Denodo
Watch full webinar here: https://bit.ly/3mfFJqb
Presented at Chief Data Officer Live Series 2021, ASEAN (August Edition)
While big data initiatives have become necessary for any business to generate actionable insights, big data fabric has become a necessity for any successful big data initiative. The best-of-breed big data fabrics should deliver actionable insights to the business users with minimal effort, provide end-to-end security to the entire enterprise data platform, and provide real-time data integration while delivering a self-service data platform to business users.
Watch this on-demand session to learn how big data fabric enabled by Data Virtualization:
- Provides lightning fast self-service data access to business users
- Centralizes data security, governance, and data privacy
- Fulfills the promise of data lakes to provide actionable insights
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata RainStor
Live Webcast October 13, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=012bb2c290097165911872b1f241531d
Hadoop data lakes are emerging as peers to corporate data warehouses. However, successful data management solutions require a fusion of all relevant data, new and old, which has proven challenging for many companies. With a data lake that’s been optimized for fast queries, solid governance and lifecycle management, users can take data management to a whole new level.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he discusses the relevance of data lakes in today’s information landscape. He’ll be briefed by Mark Cusack of Teradata, who will explain how his company’s archiving solution has developed into a storage point for raw data. He’ll show how the proven compression, scalability and governance of Teradata RainStor combined with Hadoop can enable an optimized data lake that serves as both reservoir for historical data and as a "system of record” for the enterprise.
Visit InsideAnalysis.com for more information.
Powering Realtime Decision Engines in Finance and Healthcare using Open Sour...Greg Makowski
http://www.kdd.org/kdd2015/industry-gov-talks.html
Financial services and healthcare companies could be the biggest beneficiaries of big data. Their realtime decision engines can be vastly improved by leveraging the latest advances in big data analytics. However, these companies are challenged in leveraging Open Software Systems (OSS). This presentation covers how, in collaboration with financial services and healthcare institutions, we built an OSS project to deliver a realtime decisioning engine for their respective applications. I will address two key issues. First, I will describe the strategy behind our hiring process to attract millennial big data developers and the results of this endeavor. Second, I will recount the collaboration effort that we had with our large clients and the various milestones we achieved during that process. I will explain the goals regarding big data analysis that our large clients presented to us and how we accomplished those goals. In particular, I will discuss how we leveraged open source to deliver a realtime decisioning software product called Kamanja to these institutions. An advantage of developing applications in Kamanja is that it is already integrated with Hadoop, Kafka for realtime data streaming, HBase and Cassandra for NoSQL data storage. I will talk about how these companies benefited from Kamanja and some of challenges we had in the design of this software. I will provide quantifiable improvements in key metrics driven by Kamanja and interesting, unsolved problems/challenges that need to be addressed for faster and wider adoption of OSS by these companies.
Agile, Automated, Aware: How to Model for SuccessInside Analysis
The Briefing Room with David Loshin and Embarcadero
Live Webcast October 27, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/onstage/g.php?MTID=eea9877b71c653c499c809c5693eae8fe
Data management teams face some tough challenges these days. Organizations need business-driven visibility that enables understanding and awareness of enterprise data assets – without worrying about definitions and change management. But with information architectures evolving into a hybrid mix of data objects and data services built over relational databases as well as big data stores, serving up accurately defined, reusable data can become a complex issue.
Register for this episode of The Briefing Room to learn from veteran Analyst David Loshin as he explains the importance of agile, automated workflows in today’s enterprise. He’ll be briefed by Ron Huizenga of Embarcadero, who will discuss how his company’s ER/Studio suite approaches data modeling and management from a modern architecture standpoint. He will explain that unifying the way information is represented can not only eliminate the need for costly workarounds, but also foster collaboration between data architects, developers and business users.
Visit InsideAnalysis.com for more information.
How to Create 80% of a Big Data Pilot ProjectGreg Makowski
When evaluating Open Source Software, or other software of a certain size or complexity, organizations frequently want to conduct a Pilot project, or Proof of Concept (POC). This talk describes a process to reduce the length of the Pilot, by leveraging configurations from performance testing to POC starting configurations.
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsGreg Makowski
This is a first presentation of Kamanja, a new open-source real-time software product, which integrates with other big-data systems. See also links: http://www.meetup.com/SF-Bay-ACM/events/223615901/ and http://Kamanja.org to download, for docs or community support. For the YouTube video, see https://www.youtube.com/watch?v=g9d87rvcSNk (you may want to start at minute 33).
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...Denodo
Watch full webinar here: https://bit.ly/3mfFJqb
Presented at Chief Data Officer Live Series 2021, ASEAN (August Edition)
While big data initiatives have become necessary for any business to generate actionable insights, big data fabric has become a necessity for any successful big data initiative. The best-of-breed big data fabrics should deliver actionable insights to the business users with minimal effort, provide end-to-end security to the entire enterprise data platform, and provide real-time data integration while delivering a self-service data platform to business users.
Watch this on-demand session to learn how big data fabric enabled by Data Virtualization:
- Provides lightning fast self-service data access to business users
- Centralizes data security, governance, and data privacy
- Fulfills the promise of data lakes to provide actionable insights
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...GetInData
Did you like it? Check out our blog to stay up to date: https://getindata.com/blog
Data Analytics became a central point in many Digital Transformation programs. Building a data-driven organisation requires a common understanding the foundations of data analytics on every level. This presentation will help you and your colleagues understand Big Data, Data Science, Machine Learning and Artificial Intelligence.
Watch our webinar about Big Data Analytics: https://youtu.be/jdfKHVWov6A
Speaker: Rafał Małanij
---
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
Watch here: https://bit.ly/2NGQD7R
In an era increasingly dominated by advancements in cloud computing, AI and advanced analytics it may come as a shock that many organizations still rely on data architectures built before the turn of the century. But that scenario is rapidly changing with the increasing adoption of real-time data virtualization - a paradigm shift in the approach that organizations take towards accessing, integrating, and provisioning data required to meet business goals.
As data analytics and data-driven intelligence takes centre stage in today’s digital economy, logical data integration across the widest variety of data sources, with proper security and governance structure in place has become mission-critical.
Attend this session to learn:
- Learn how you can meet cloud and data science challenges with data virtualization.
- Why data virtualization is increasingly finding enterprise-wide adoption
- Discover how customers are reducing costs and improving ROI with data virtualization
CTO of ParStream Joerg Bienert hold a presentation on February 25, 2014 about Big Data for Business Users. He talked about several use cases of current ParStream customers and ParStreams' technology itself.
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3rpr4La
Data is an insurer’s most valuable asset. Capitalizing on all of that stored and incoming data to draw valuable insights for business decisions is what ultimately makes a competitive difference.
But, insurers face challenges when it comes to modernizing and digitizing their data architectures. Most organizations rely on traditional systems and data integration processes that are time consuming and slow. In addition, as many adopt cloud strategies, these traditional approaches fill the cloud modernization process with downtime and end user frustration.
This is why insurers need a flexible and easily adaptable data integration technology that allows them to keep up with the ever-changing and growing data environment.
Data virtualization is that modern data integration technology. It can support insurers not only on their journey to digitization, but also on their future infrastructure changes and innovations, adding agility, flexibility and efficiency to data architectures. Data virtualization can help insurance companies create 360° views of deals and claims processes as well as gather quick social media or sensor data for on-the-go risk profiling.
Join this on-demand webinar to:
- Find out why data virtualization should be a part of your enterprise data strategy
- See how this technology can help you capitalize on your data
- Hear how many of your peers are already leveraging the Denodo Platform for Data Virtualization and the benefits they’re observing
Transforming GE Healthcare with Data Platform StrategyDatabricks
Data and Analytics is foundational to the success of GE Healthcare’s digital transformation and market competitiveness. This use case focuses on a heavy platform transformation that GE Healthcare drove in the last year to move from an On prem legacy data platforming strategy to a cloud native and completely services oriented strategy. This was a huge effort for an 18Bn company and executed in the middle of the pandemic. It enables GE Healthcare to leap frog in the enterprise data analytics strategy.
There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.
But where are the data science and data engineering patterns?
Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.
Watch full webinar here: https://bit.ly/3mdj9i7
You will often hear that "data is the new gold"? In this context, data management is one of the areas that has received more attention from the software community in recent years. From Artificial Intelligence and Machine Learning to new ways to store and process data, the landscape for data management is in constant evolution. From the privileged perspective of an enterprise middleware platform, we at Denodo have the advantage of seeing many of these changes happen.
In this webinar, we will discuss the technology trends that will drive the enterprise data strategies in the years to come. Don't miss it if you want to keep yourself informed about how to convert your data to strategic assets in order to complete the data-driven transformation in your company.
Watch this on-demand webinar as we cover:
- The most interesting trends in data management
- How to build a data fabric architecture?
- How to manage your data integration strategy in the new hybrid world
- Our predictions on how those trends will change the data management world
- How can companies monetize the data through data-as-a-service infrastructure?
- What is the role of voice computing in future data analytic
Watch here: https://bit.ly/3i2iJbu
You will often hear that "data is the new gold". In this context, data management is one of the areas that has received more attention by the software community in recent years. From Artificial Intelligence and Machine Learning to new ways to store and process data, the landscape for data management is in constant evolution. From the privileged perspective of an enterprise middleware platform, we at Denodo have the advantage of seeing many of these changes happen.
Join us for an exciting session that will cover:
- The most interesting trends in data management.
- Our predictions on how those trends will change the data management world.
- How these trends are shaping the future of data virtualization and our own software.
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...Dataconomy Media
Stephen Cantrell, kdb+ Developer at Kx Systems
“Kdb+: How Wall Street Tech can Speed up the World"
You can see some additional notes here:
https://github.com/cantrells/berlin_kdb_demo?files=1
Strategizing Big Data in Telco
Big data feels to be a very hot topic nowadays. Some industries depend on it completely, some have opportunities to roll out their strategies and execute, some just considering when it is a right time to hop in.
To my mind, Big Data is not about technology. Big data is about people generating data and data used for the benefit of people.
Big data is a pool of activities intended at processing the data a company owns (internal and external) so that to open new revenue opportunities, minimize costs and enhance UX.
I had some ideas and thoughts on what telecommunication companies may start from in formulating the Big Data Strategy and so packed some of the most important pieces of thoughts into a small presentation.
What is the difference between Small Data and Big Data?
What kind of data is used currently and which is to be relied on a new paradigm?
What kind of products are expected from telcos?
My personal ranking of operators in terms of their Big Data execution
What are the stages telcos should pass through to become a Big Data operator?
Prerequisites for Big Data transformation
Please take a look at the presentation to find answers to these questions and feel free to share your opinion.
Thanks!
Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo
Watch full webinar here: https://bit.ly/3fpitC3
Enterprise organizations are shifting to self-service analytics as business users need real-time access to holistic and consistent views of data regardless of its location, source or type for arriving at critical decisions.
Data Virtualization and Data Visualization work together through a universal semantic layer. Learn how they enable self-service data discovery and improve performance of your reports and dashboards.
In this session, you will learn:
- Challenges faced by business users
- How data virtualization enables self-service analytics
- Use case and lessons from customer success
- Overview of the highlight features in Tableau
San Antonio’s electric utility making big data analytics the business of the ...DataWorks Summit
Being part of a municipality-owned electric utility offers a unique opportunity to lead in the area of big data analytics. What moves the electric utility of the 7th largest city in the U.S.? The answer is, people. For years, CPS Energy has invested in development of local talent, local technology development, city growth, its employees, and an asset infrastructure that is setting the stage for continued success. At CPS Energy, when such investments are topped by a data infrastructure and applications conducive to creation of business insights, we can justify and prioritize investments. For us, the biggest people opportunities in big data analytics are around operations, customer and employee engagement, and safety. The presenter will provide examples and share how his views have evolved from those of a researcher to global renewable energy consultant to technology innovator and more recently a “harvester of value” from within people, process, and technology assets. Lastly, current and anticipated future states with regards to San Antonio’s electric utility big data enablement platform will be presented...
Speaker
Rolando Vega, Manager of Analytics and Business Insight, CPS Engery
Moving Targets: Harnessing Real-time Value from Data in Motion Inside Analysis
The Briefing Room with David Loshin and Datawatch
Live Webcast Feb. 17, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=4a053043c45cf0c2f6453dfb8577c72a
Patience may be a virtue, but when it comes to streaming analytics, waiting is no option. Between Big Data and the Internet of Things, businesses are faced with more data and greater complexity than ever before. Traditional information architectures simply cannot support the kind of processing necessary to make use of this fast-moving resource. The modern context requires a shorter path to analytics, one that narrows the gap between governance and discovery
Register for this episode of The Briefing Room to hear veteran Analyst David Loshin as he explains how the prevalence of streaming data is changing business pace and processes. He’ll be briefed by Dan Potter of Datawatch, who will tout his company’s real-time data discovery platform for data in motion. He will show how self-service data preparation can lead to faster insights, ultimately fostering the ability to make precise decisions at the right time.
Visit InsideAnalysis.com for more information.
When you look at traditional ERP or management systems, they are usually used to manage the supply chain originating from either the point of Origin or point of destination which all our primarily physical locations. And for these, you have several processes like order to cash, source to pay, physical distribution, production etc.
DoneDeal AWS Data Analytics Platform build using AWS products: EMR, Data Pipeline, S3, Kinesis, Redshift and Tableau. Custom built ETL was written using PySpark.
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...GetInData
Did you like it? Check out our blog to stay up to date: https://getindata.com/blog
Data Analytics became a central point in many Digital Transformation programs. Building a data-driven organisation requires a common understanding the foundations of data analytics on every level. This presentation will help you and your colleagues understand Big Data, Data Science, Machine Learning and Artificial Intelligence.
Watch our webinar about Big Data Analytics: https://youtu.be/jdfKHVWov6A
Speaker: Rafał Małanij
---
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
Watch here: https://bit.ly/2NGQD7R
In an era increasingly dominated by advancements in cloud computing, AI and advanced analytics it may come as a shock that many organizations still rely on data architectures built before the turn of the century. But that scenario is rapidly changing with the increasing adoption of real-time data virtualization - a paradigm shift in the approach that organizations take towards accessing, integrating, and provisioning data required to meet business goals.
As data analytics and data-driven intelligence takes centre stage in today’s digital economy, logical data integration across the widest variety of data sources, with proper security and governance structure in place has become mission-critical.
Attend this session to learn:
- Learn how you can meet cloud and data science challenges with data virtualization.
- Why data virtualization is increasingly finding enterprise-wide adoption
- Discover how customers are reducing costs and improving ROI with data virtualization
CTO of ParStream Joerg Bienert hold a presentation on February 25, 2014 about Big Data for Business Users. He talked about several use cases of current ParStream customers and ParStreams' technology itself.
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3rpr4La
Data is an insurer’s most valuable asset. Capitalizing on all of that stored and incoming data to draw valuable insights for business decisions is what ultimately makes a competitive difference.
But, insurers face challenges when it comes to modernizing and digitizing their data architectures. Most organizations rely on traditional systems and data integration processes that are time consuming and slow. In addition, as many adopt cloud strategies, these traditional approaches fill the cloud modernization process with downtime and end user frustration.
This is why insurers need a flexible and easily adaptable data integration technology that allows them to keep up with the ever-changing and growing data environment.
Data virtualization is that modern data integration technology. It can support insurers not only on their journey to digitization, but also on their future infrastructure changes and innovations, adding agility, flexibility and efficiency to data architectures. Data virtualization can help insurance companies create 360° views of deals and claims processes as well as gather quick social media or sensor data for on-the-go risk profiling.
Join this on-demand webinar to:
- Find out why data virtualization should be a part of your enterprise data strategy
- See how this technology can help you capitalize on your data
- Hear how many of your peers are already leveraging the Denodo Platform for Data Virtualization and the benefits they’re observing
Transforming GE Healthcare with Data Platform StrategyDatabricks
Data and Analytics is foundational to the success of GE Healthcare’s digital transformation and market competitiveness. This use case focuses on a heavy platform transformation that GE Healthcare drove in the last year to move from an On prem legacy data platforming strategy to a cloud native and completely services oriented strategy. This was a huge effort for an 18Bn company and executed in the middle of the pandemic. It enables GE Healthcare to leap frog in the enterprise data analytics strategy.
There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.
But where are the data science and data engineering patterns?
Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.
Watch full webinar here: https://bit.ly/3mdj9i7
You will often hear that "data is the new gold"? In this context, data management is one of the areas that has received more attention from the software community in recent years. From Artificial Intelligence and Machine Learning to new ways to store and process data, the landscape for data management is in constant evolution. From the privileged perspective of an enterprise middleware platform, we at Denodo have the advantage of seeing many of these changes happen.
In this webinar, we will discuss the technology trends that will drive the enterprise data strategies in the years to come. Don't miss it if you want to keep yourself informed about how to convert your data to strategic assets in order to complete the data-driven transformation in your company.
Watch this on-demand webinar as we cover:
- The most interesting trends in data management
- How to build a data fabric architecture?
- How to manage your data integration strategy in the new hybrid world
- Our predictions on how those trends will change the data management world
- How can companies monetize the data through data-as-a-service infrastructure?
- What is the role of voice computing in future data analytic
Watch here: https://bit.ly/3i2iJbu
You will often hear that "data is the new gold". In this context, data management is one of the areas that has received more attention by the software community in recent years. From Artificial Intelligence and Machine Learning to new ways to store and process data, the landscape for data management is in constant evolution. From the privileged perspective of an enterprise middleware platform, we at Denodo have the advantage of seeing many of these changes happen.
Join us for an exciting session that will cover:
- The most interesting trends in data management.
- Our predictions on how those trends will change the data management world.
- How these trends are shaping the future of data virtualization and our own software.
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...Dataconomy Media
Stephen Cantrell, kdb+ Developer at Kx Systems
“Kdb+: How Wall Street Tech can Speed up the World"
You can see some additional notes here:
https://github.com/cantrells/berlin_kdb_demo?files=1
Strategizing Big Data in Telco
Big data feels to be a very hot topic nowadays. Some industries depend on it completely, some have opportunities to roll out their strategies and execute, some just considering when it is a right time to hop in.
To my mind, Big Data is not about technology. Big data is about people generating data and data used for the benefit of people.
Big data is a pool of activities intended at processing the data a company owns (internal and external) so that to open new revenue opportunities, minimize costs and enhance UX.
I had some ideas and thoughts on what telecommunication companies may start from in formulating the Big Data Strategy and so packed some of the most important pieces of thoughts into a small presentation.
What is the difference between Small Data and Big Data?
What kind of data is used currently and which is to be relied on a new paradigm?
What kind of products are expected from telcos?
My personal ranking of operators in terms of their Big Data execution
What are the stages telcos should pass through to become a Big Data operator?
Prerequisites for Big Data transformation
Please take a look at the presentation to find answers to these questions and feel free to share your opinion.
Thanks!
Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo
Watch full webinar here: https://bit.ly/3fpitC3
Enterprise organizations are shifting to self-service analytics as business users need real-time access to holistic and consistent views of data regardless of its location, source or type for arriving at critical decisions.
Data Virtualization and Data Visualization work together through a universal semantic layer. Learn how they enable self-service data discovery and improve performance of your reports and dashboards.
In this session, you will learn:
- Challenges faced by business users
- How data virtualization enables self-service analytics
- Use case and lessons from customer success
- Overview of the highlight features in Tableau
San Antonio’s electric utility making big data analytics the business of the ...DataWorks Summit
Being part of a municipality-owned electric utility offers a unique opportunity to lead in the area of big data analytics. What moves the electric utility of the 7th largest city in the U.S.? The answer is, people. For years, CPS Energy has invested in development of local talent, local technology development, city growth, its employees, and an asset infrastructure that is setting the stage for continued success. At CPS Energy, when such investments are topped by a data infrastructure and applications conducive to creation of business insights, we can justify and prioritize investments. For us, the biggest people opportunities in big data analytics are around operations, customer and employee engagement, and safety. The presenter will provide examples and share how his views have evolved from those of a researcher to global renewable energy consultant to technology innovator and more recently a “harvester of value” from within people, process, and technology assets. Lastly, current and anticipated future states with regards to San Antonio’s electric utility big data enablement platform will be presented...
Speaker
Rolando Vega, Manager of Analytics and Business Insight, CPS Engery
Moving Targets: Harnessing Real-time Value from Data in Motion Inside Analysis
The Briefing Room with David Loshin and Datawatch
Live Webcast Feb. 17, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=4a053043c45cf0c2f6453dfb8577c72a
Patience may be a virtue, but when it comes to streaming analytics, waiting is no option. Between Big Data and the Internet of Things, businesses are faced with more data and greater complexity than ever before. Traditional information architectures simply cannot support the kind of processing necessary to make use of this fast-moving resource. The modern context requires a shorter path to analytics, one that narrows the gap between governance and discovery
Register for this episode of The Briefing Room to hear veteran Analyst David Loshin as he explains how the prevalence of streaming data is changing business pace and processes. He’ll be briefed by Dan Potter of Datawatch, who will tout his company’s real-time data discovery platform for data in motion. He will show how self-service data preparation can lead to faster insights, ultimately fostering the ability to make precise decisions at the right time.
Visit InsideAnalysis.com for more information.
When you look at traditional ERP or management systems, they are usually used to manage the supply chain originating from either the point of Origin or point of destination which all our primarily physical locations. And for these, you have several processes like order to cash, source to pay, physical distribution, production etc.
DoneDeal AWS Data Analytics Platform build using AWS products: EMR, Data Pipeline, S3, Kinesis, Redshift and Tableau. Custom built ETL was written using PySpark.
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Big Data Processing with Apache Spark 2014mahchiev
Apache Spark™ is a fast and general engine for large-scale data processing. It has gained enormous popularity recently with its speed and ease of use and is currently replacing traditional Hadoop MapReduce. We'll talk about:
1. What is Big Data ?
2. The Map-Reduce paradigm
3. What does Apache Spark do?
4. Finally, we'll make a quick demo
Introduction to Modern Software ArchitectureJérôme Kehrli
This talk offers an introduction to software architecture with a modern perspective. We will consider a new way to identify architectural elements and walk through some examples of modern architectures, the NoSQL world, Big Data architectures and micro-services.
A proposed framework for Agile Roadmap Design and MaintenanceJérôme Kehrli
Maintaining a relevant and meaningful roadmap while adopting a state of the art Agile methodology is challenging and somewhat antonymous.
This presentation proposes a framework for designing and maintaining an Agile Roadmap.
A presentation of the search for Product-Market Fit with the principles, practices and processes that lead to it, from the Lean-Startup and Design Thinking perspective
From Product Vision to Story Map - Lean / Agile Product shapingJérôme Kehrli
A lot of Software Engineering projects fail for a lack of shared vision due to poor communication among people involved in the project.
A sound maintenance of the product backlog can only be achieved if all the people have a good understanding of what they have to do (common vision).
Roman Pichler, in a post originally written in Jul 16 2012, has proposed a really interesting approach: use various canvas to create and share product vision and product backlog creation and refinement.
This presentation is a drive through these various boards and canvas that should be designed in prior to any product development: the Product Vision, the Lean Canvas, The Product Definition and the Story Map.
Artificial Intelligence and Digital Banking - What about fraud prevention ?Jérôme Kehrli
Artificial intelligence for banking fraud prevention.
A presentation on how it takes its root in the digitalisation ways and how it impacts customer experience.
Artificial Intelligence for Banking Fraud PreventionJérôme Kehrli
Artificial Intelligence at NetGuardians:
"From skepticism to large scale adoption towards fraud prevention"
Slides of my speech at the EPFL / EMBA Innovation Leader 2018 event.
Periodic Table of Agile Principles and PracticesJérôme Kehrli
Recently I fell by chance on the Periodic Table of the Elements... Long time no see... Remembering my physics lessons in University, I always loved that table. I remembered spending hours understanding the layout and admiring the beauty of its natural simplicity.
So I had the idea of trying the same layout, not the same approach since both are not comparable, really only the same layout for Agile Principles and Practices.
The result is in this presentation: The Periodic Table of Agile Principles and Practices:
Agility and planning : tools and processesJérôme Kehrli
In this presentation, I intend to present the fundamentals, the roles, the processes, the rituals and the values that I believe a team would need to embrace to achieve success down the line in Agile Software Development Management - Product Management, Team Management and Project Management - with the ultimate goal of making planning and forecasting as simple and efficient as it can be.
Bytecode manipulation with Javassist for fun and profitJérôme Kehrli
Java bytecode is the form of instructions that the JVM executes.
A Java programmer, normally, does not need to be aware of how Java bytecode works.
Understanding the bytecode, however, is essential to the areas of tooling and program analysis, where the applications can modify the bytecode to adjust the behavior according to the application's domain. Profilers, mocking tools, AOP, ORM frameworks, IoC Containers, boilerplate code generators, etc. require to understand Java bytecode thoroughly and come up with means of manipulating it at runtime.
Each and every of these advanced features of what is nowadays standard approaches when programming with Java require a sound understanding of the Java bytecode, not to mention completely new languages running on the JVM such as Scala or Clojure.
Bytecode manipulation is not easy though ... except with Javassist.
Of all the libraries and tools providing advanced bytecode manipulation features, Javassist is the easiest to use and the quickest to master. It takes a few minutes to every initiated Java developer to understand and be able to use Javassist efficiently. And mastering bytecode manipulation, opens a whole new world of approaches and possibilities.
DevOps is a methodology capturing the practices adopted from the very start by the web giants who had a unique opportunity as well as a strong requirement to invent new ways of working due to the very nature of their business: the need to evolve their systems at an unprecedented pace as well as extend them and their business sometimes on a daily basis.
While DevOps makes obviously a critical sense for startups, I believe that the big corporations with large and old-fashioned IT departments are actually the ones that can benefit the most from adopting these principles and practices.
Digitalization: A Challenge and An Opportunity for BanksJérôme Kehrli
Today’s banking industry era is strongly defined by a word - digital. The urgency to act is only getting severe each day. Banks using digital technologies to automate processes, improve regulatory compliance, and transform the customer experience may realize a profit upside of 40% or more, while laggards that resist digital innovation will be punished by customers, financial markets, regulators, and may see up to 35% of net profit eroded, according to a McKinsey analysis.
The vital question to answer is, do we get digitalization right? Why is it getting extremely urgent to digitize?
Some years ago, Eric Ries, Steve Blank and others initiated The Lean Startup movement. The Lean Startup is a movement, an inspiration, a set of principles and practices that any entrepreneur initiating a startup would be well advised to follow.
Projecting myself into it, I think that if I had read Ries' book before, or even better Blank's book, I would maybe own my own company today, around AirXCell or another product, instead of being disgusted and honestly not considering it for the near future.
In addition to giving a pretty important set of principles when it comes to creating and running a startup, The Lean Startup also implies an extended set of Engineering practices, especially software engineering practices.
Smart Contracts are a central component to next-generation blockchain platforms. Blockchain technology is much broader than just bitcoin. The sustained levels of robust security achieved by public cryptocurrencies have demonstrated to the world that this new wave of blockchain technologies can provide efficiencies and intangible technological benefits very similar to what the internet has done.
Blockchains are a very powerful technology, capable of going much further than only "simple" financial transaction; a technology capable of performing complex operations, capable of understanding much more than just how many bitcoins one currently has in his digital wallet.
This is where the idea of Smart Contracts come in. Smart Contracts are in the process of becoming a cornerstone for enterprise blockchain applications and will likely become one of the pillars of blockchain technology.
In this presentation, we will explore what a smart contract is, how it works, and how it is being used.
The Blockchain - The Technology behind Bitcoin Jérôme Kehrli
The blockchain and blockchain related topics are becoming increasingly discussed and studied nowadays. There is not one single day where I don't hear about it, that being on linkedin or elsewhere.
I interested myself deeply in the blockchain topic recently and this is the first article of a coming whole serie around the blockchain.
This presentation is an introduction to the blockchain, presents what it is in the light of its initial deployment in the Bitcoin project as well as all technical details and architecture concerns behind it.
We won't focus here on business applications aside from what is required to present the blockchain purpose, more concrete business applications and evolutions will be the topic of another presentation I'll post in a few weeks
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
3. About NetGuardians
• Top Fintech Europe Company
• Behavioural analysis based on risk
models combining human actions
relative to channels, technical
layers and transactions.
• Stay on top of new regulatory
needs and anti-fraud patterns
using profiling and analytics
• Our intelligence updates
automatically deliver new controls
XXXXXX XXX
E-BANKINGE-BANKING
IT layers
Transactions
Channels
4. The Problem
70% is internal
Fraud costs the world
$3trillion per year
Certified Fraud Examiners,
Report to the Nations, 2014
$6 trillion
Projected cyber crime
cost by 2021
Cyber Security Ventures, 2016
It takes 18 months on average
to detect fraud.
Most remains undetected.
Certified Fraud Examiners, Report to the
Nations, 2014
$6
trillion
$3
trillion
The fine
one single bank was slapped
with due to inadequate
internal controls and slow
documentation process
Bloomberg, April 2015
$2.5
billion
5. All the caps you need
One single platform
Unique solution made for banks
6. All the caps you need
One single platform
Unique solution made for banks
10. Mesos is a distributed systems kernel.
Runs on every machine and provides applications (…) with
API’s for resource management and scheduling across
entire datacenter and cloud environments.
Apache Spark is a fast and general engine for large-scale
data processing.
Provides programmers with an API functioning as a working
set for distributed programs that offers a versatile form of
distributed shared memory.
ElasticSearch is a distributed, real-time, RESTful search and
analytics document-oriented storage engine.
Lets one perform and combine many types of searches -
structured, unstructured, geo, metric - in real time.
Apache
(V1.3 = July 2017
V1.0 = July 2016)
Apache
(V2.2 = July 2017
V1.0 = May 2014)
ElasticSearch
(V6.0b = July 2017
V1.0 = February 2014)
11. ES-Hadoop : connect the massive data storage and deep
processing power of Hadoop with the real-time search and
analytics of Elasticsearch.
Interestingly, Spark can perfectly use ES-Hadoop to load from
or store data to ElasticSearch outside of an Hadoop stack.
The spark connector from the ES-Hadoop library has no
dependency on a Hadoop stack whatsoever.
ES-Hadoop
ES
17. Analytics approach
Pattern Based Intelligence
• Fundamentally rule based
• Implemented as pyspark scripts
• Custom approach (no framework)
Profiling
• Statistical Model
• Natively implemented using both
ES and spark statistics functions
• Custom approach (no framework)
Machine Learning
• Advanced algorithms
• Prototyped using Python SciKit
learn
• Industrialized using Spark MLlib
20. ES-Hadoop and Data Locality
Data-locality enforcement works well.
• ES-Hadoop makes Spark understand the
topology of the shards on ES
• Mesos / Spark respects locality requirements,
creates as many partitions as shards.
It works only under nominal conditions.
Several factors compromise data-locality:
→ Spark waits only for
spark.locality.wait=10s trying to get the
processing executed on the spark node co-
located to an ES shard
← If ES on co-located node is busy, ES can decide
to answer from another node
21. Mesos / Spark Scheduling Mode
In Coarse Grained scheduling mode, Mesos only
knows spark executor processes.
• Mesos books as much cluster resources as
possible to allocate Spark executors for a job.
Historically, Mesos on Spark can use Fine
Grained scheduling mode, where Mesos
schedules each and every individual spark task.
• Kills performances !
• Deprecated:
https://issues.apache.org/jira/browse/SPARK
-11857
22. Spark Static Resource Allocation vs. Dynamic Allocation (1/2)
Static Resource Allocation
• Mesos / Spark decides allocated resources at
job init time
• Allocated resources are kept until the job
completes
• 2 noteworthy consequences :
1. By default, every single job running
alone gets the whole cluster.
A following job would need to wait.
2. Several jobs arriving together would get
the cluster fairly shared.
If only one job is long-lived, that job
would still need to complete its
execution on his small portion.
23. Spark Static Resource Allocation vs. Dynamic Allocation (2/2)
Dynamic Allocation
• Designed as a solution the previous problems
• But … Spark‘s Dynamic Allocation messes up
data locality optimization completely.
• ES-Hadoop makes spark request as many
executors as shards and indicates
as preferred location the nodes owning the
ES shards.
• Dynamic allocation bypasses this
completely and screws data-locality
optimization
Dynamic Allocation
• Designed as a solution the previous problems
• Works out of the Box
24. Other concerns
• Python latency
• Java and Scala jobs run natively in the Spark JVM.
• Pyspark launches “some tasks” in a separate process than the Spark JVM.
• DataFrame or RDD methods exposed to python scripts are actually implemented in native
Scala underneath.
• One noticeable exception: UDF (User Defined functions) implemented in python!
• One can very well still use pyspark but write UDF in Scala.
• Repartitioning
• A redistribution of a dataset on the cluster is only hardly achievable … and not necessarily
desirable.
• Advanced ES queries
• The ES-Hadoop connector can only submit “simple” requests to ES, with filtering (now)
• Advanced features such as aggregation queries cannot be used
26. Why cool ? (1/5)
Spark’s API is brilliant for our use cases (NetGuardians)
Pattern Based Intelligence
• Implementing our rules in pyspark
is straightforward
• We are now considering DRESS on
spark streaming
Profiling
• Out of the box with Spark’s
statistics functions
• Here as well we consider spark
streaming for event scoring
Machine Learning
• We prototype with Python SciKit
Learn
• Implementation on spark is easy
with Spark MLlib
27. Why cool ? (2/5)
What do we want ?
Initial situation
28. Why cool ? (2/5)
What do we want ?
Working with a small
subset of the data
29. Why cool ? (2/5)
What do we want ?
Working with a full
month of data
30. Why cool ? (2/5)
What do we want ?
Working with the
whole dataset
31. Why cool ? (3/5)
Processing Distribution scaling linearly with Data Distribution
Works Out of the box with
• Dynamic Allocation in Spark + Mesos
• ES-Hadoop / ES-Spark connector data locality optimization
32. Why cool ? (4/5)
Processing Distribution scaling linearly with Data Distribution
ES / Spark / Mesos provide the basic building blocks to distribute
and scale the processing exactly how we want
• ES-Hadoop : Data locality optimization
• Mesos / Spark : spark.cores.max=X configuration
• ElasticSearch : search_shards API
Golden Rule : use spark.core.max = Nbr Shards
33. Why cool ? (5/5)
“One ring to rule them all ...”
• ES, Spark and Mesos are
designed to run on large clusters
• But they work very well as well
on one single fat machine with
tons of CPUs and RAM
• We deploy the same platform in
tier 1 banks and small banks.
Disclaimer
Je ne vais pas faire une intro à big data, je pars du principe que l’audience est familiarisée à Big Data (Moving processing to the data nodes, distribution of the form of partitioning and replicating, etc.)
Je vais me concentrer sur la spécificité de la stack technologique chez NetGuardians
Our intelligent software platform will give you a greater capability to detect the emerging insider fraud & risk threats, delivering ROI straight away.
OK
!! Préparer l’explication !!!
!! Préparer l’explication !!!
!! Préparer l’explication !!!
!! Préparer l’explication !!!
27 -> 20 !!!!
Présentation NetGuardians “un peu” plus courte
Plus de détail, allez voir notre site web !!
Vocabulaire : Big data = distribution = partitionnement (sharding) et réplication !
Les 3 slides super-textuels !!!
“Vous aurez les slides ….”
Juste préparer un résumé …..