Modernizing analytics data pipelines to gain the most of your data while optimizing costs can be challenging. However, today cloud providers offer a good set of services that can help with this endeavor. We will do a tour across some GCP services during this hands-on session, using DataFlow (apache beam) as the backbone to architect a modern analytics pipeline to wire them all together.
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...Mariano Gonzalez
Everybody wants to do big data on a data lake! However, implementing it and maintaining the infrastructure necessary to explore it, such as Spark, has been a historically challenging endeavor. Kubernetes is the tool of choice for cloud orchestration, and Spark continues to be the de facto framework for most data wrangling tasks. We’ve previously tried different data lake architectures, and suffered from the pain that Hadoop carries with it. Finally, we decided to bring the best from the cloud and big data worlds together, and walk you through a session on how to set an endless data lake powered with native Spark executors on Kubernetes
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020.
Abstract:
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh.
The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership.
This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.
Deep-dive into Microservices Patterns with Replication and Stream Analytics
Target Audience: Microservices and Data Architects
This is an informational presentation about microservices event patterns, GoldenGate event replication, and event stream processing with Oracle Stream Analytics. This session will discuss some of the challenges of working with data in a microservices architecture (MA), and how the emerging concept of a “Data Mesh” can go hand-in-hand to improve microservices-based data management patterns. You may have already heard about common microservices patterns like CQRS, Saga, Event Sourcing and Transaction Outbox; we’ll share how GoldenGate can simplify these patterns while also bringing stronger data consistency to your microservice integrations. We will also discuss how complex event processing (CEP) and stream processing can be used with event-driven MA for operational and analytical use cases.
Business pressures for modernization and digital transformation drive demand for rapid, flexible DevOps, which microservices address, but also for data-driven Analytics, Machine Learning and Data Lakes which is where data management tech really shines. Join us for this presentation where we take a deep look at the intersection of microservice design patterns and modern data integration tech.
this is part 3 of the series on Data Mesh ... looking at the intersection of microservices architecture concepts, data integration / replication technologies and log-based stream integration techniques. This webinar was mostly a demonstration, but several slides used to setup the demo are included here as a PDF for viewers.
How data modelling helps serve billions of queries in millisecond latency wit...DataWorks Summit
Users of financial data require their queries to return results with very low latency. As a financial data service provider, Bloomberg needs to consistently meet these requirements for our clients.
HBase promises millisecond latency, auto-sharding, and an open schema. However, as HBase is a NoSQL database, support for transaction processing is not trivial. This talk will discuss a data modelling technique and use case where effective transaction processing can be achieved. We will also discuss how this data model helps us achieve real-time streaming, scalability, and millisecond read-write latency for billions of queries each day.
So why hasn’t everyone already moved to the Cloud? Why hasn’t everyone already transformed into a data-driven organization? What obstacles are standing the way? How should organizations get started on their journey? Financial institutions are quickly embracing the speed and agility that a cloud-based digital transformation can provide. This session will provide an overview for how retail banking, investment banking, and insurance can remove obstacles and launch a successful analytics journey to the cloud.
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...Mariano Gonzalez
Everybody wants to do big data on a data lake! However, implementing it and maintaining the infrastructure necessary to explore it, such as Spark, has been a historically challenging endeavor. Kubernetes is the tool of choice for cloud orchestration, and Spark continues to be the de facto framework for most data wrangling tasks. We’ve previously tried different data lake architectures, and suffered from the pain that Hadoop carries with it. Finally, we decided to bring the best from the cloud and big data worlds together, and walk you through a session on how to set an endless data lake powered with native Spark executors on Kubernetes
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020.
Abstract:
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh.
The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership.
This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.
Deep-dive into Microservices Patterns with Replication and Stream Analytics
Target Audience: Microservices and Data Architects
This is an informational presentation about microservices event patterns, GoldenGate event replication, and event stream processing with Oracle Stream Analytics. This session will discuss some of the challenges of working with data in a microservices architecture (MA), and how the emerging concept of a “Data Mesh” can go hand-in-hand to improve microservices-based data management patterns. You may have already heard about common microservices patterns like CQRS, Saga, Event Sourcing and Transaction Outbox; we’ll share how GoldenGate can simplify these patterns while also bringing stronger data consistency to your microservice integrations. We will also discuss how complex event processing (CEP) and stream processing can be used with event-driven MA for operational and analytical use cases.
Business pressures for modernization and digital transformation drive demand for rapid, flexible DevOps, which microservices address, but also for data-driven Analytics, Machine Learning and Data Lakes which is where data management tech really shines. Join us for this presentation where we take a deep look at the intersection of microservice design patterns and modern data integration tech.
this is part 3 of the series on Data Mesh ... looking at the intersection of microservices architecture concepts, data integration / replication technologies and log-based stream integration techniques. This webinar was mostly a demonstration, but several slides used to setup the demo are included here as a PDF for viewers.
How data modelling helps serve billions of queries in millisecond latency wit...DataWorks Summit
Users of financial data require their queries to return results with very low latency. As a financial data service provider, Bloomberg needs to consistently meet these requirements for our clients.
HBase promises millisecond latency, auto-sharding, and an open schema. However, as HBase is a NoSQL database, support for transaction processing is not trivial. This talk will discuss a data modelling technique and use case where effective transaction processing can be achieved. We will also discuss how this data model helps us achieve real-time streaming, scalability, and millisecond read-write latency for billions of queries each day.
So why hasn’t everyone already moved to the Cloud? Why hasn’t everyone already transformed into a data-driven organization? What obstacles are standing the way? How should organizations get started on their journey? Financial institutions are quickly embracing the speed and agility that a cloud-based digital transformation can provide. This session will provide an overview for how retail banking, investment banking, and insurance can remove obstacles and launch a successful analytics journey to the cloud.
Presentation of my talk given in Phoenix Data Conference 2019. In this we will look at challenges with current Apache Hadoop ecosystem
Apache Hadoop is still relevant but way of doing Hadoop and enterprise data architecture has to be re-looked as we enter Cognitive and Cloud Native Era
We need
Architecture that is enabled by common run time layer across on premise and cloud
Architecture that can abstract away dependency and version conflicts with tons of open source machine learning out there. Yarn did not scale up in that aspect until one want to deal with multiple conda environment
Architecture that can enable real Hybrid Cloud and Multi Cloud portability
And many more challenges that one has to overcome to keep architecture simple, infrastructure agile and better utilized
SnapLogic is a US-based, venture-funded software company that is attempting to reinvent integration platform technology - by creating one unified platform that can address many different kinds of application and data integration use cases.
Cloud Modernization and Data as a Service OptionDenodo
Watch here: https://bit.ly/36tEThx
The current data landscape is fragmented, not just in location but also in terms of shape and processing paradigms. Cloud has become a key component of modern architecture design. Data lakes, IoT, NoSQL, SaaS, etc. coexist with relational databases to fuel the needs of modern analytics, ML and AI. Exploring and understanding the data available within your organization is a time-consuming task. Dealing with bureaucracy, different languages and protocols, and the definition of ingestion pipelines to load that data into your data lake can be complex. And all of this without even knowing if that data will be useful at all.
Attend this session to learn:
- How dynamic data challenges and the speed of change requires a new approach to data architecture – one that is real-time, agile and doesn’t rely on physical data movement.
- Learn how logical data architecture can enable organizations to transition data faster to the cloud with zero downtime and ultimately deliver faster time to insight.
- Explore how data as a service and other API management capabilities is a must in a hybrid cloud environment.
(Bjørn Kvernstuen + Tommy Jocumsen, Norwegian Directorate for Work and Welfare) Kafka Summit SF 2018
NAV (Norwegian Work and Welfare Department) currently distributes more than one third of the national budget to citizens in Norway or abroad. We’re there to assist people through all phases of life within the domains of work, family, health, retirement and social security. Events happening throughout a person’s life determines which services we provide to them, how we provide them and when we provide them.
Today, each person has to apply for these services resulting in many tasks that are largely handled manually by various case workers in the organization. Their access to insight and useful information is limited and often hard to find, causing frustration to both our case workers and our users. By streaming a person’s life events through our Kafka pipelines, we can revolutionize the way our users experience our services and the way we work.
NAV and the government as a whole have access to vast amounts of data about our citizens, reported by health institutions, employers, various government agencies or the users themselves. Some data is distributed by large batches, while others are available on-demand through APIs. We’re changing these patterns into streams using Kafka, Streams API and Java microservices. We aim to distribute and act on events about birth, death, relationships, employment, income and business processes to vastly improve the user experience, provide real-time insight and reduce the need to apply for services we already know are needed.
This talk will touch on the following topics:
-How we move from data-on-demand to streams
-How streams of life events will free our case workers from mundane tasks
-How life and business events make valuable insight
-How we protect our users and comply with GDPR
-Why we chose Confluent Platform
Hadoop for Humans: Introducing SnapReduce 2.0SnapLogic
In this webinar, we talk about Hadoop, big data and SnapReduce 2.0 with SnapLogic Chief Scientist Greg Benson, Professor of Computer Science at the University of San Francisco. This webinar features a dive into SnapReduce, and a discussion about how SnapLogic delivers big data acquisition, better big data preparation and universal big data delivery.
To learn more, visit: http://www.snaplogic.com/snapreduce
Big Data Management: What's New, What's Different, and What You Need To KnowSnapLogic
This presentation is from a recorded webinar with 451 Research analyst and thought leader Matt Aslett for a discussion about the growing importance of the right data management best practices and techniques for delivering on the promise of big data in the enterprise. Matt reviews the big data landscape, how the data lake complements and competes with the data warehouse, and key takeaways as you move from big data test and development environments to production. You can watch the webinar here: http://bit.ly/25ShiQu
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...SnapLogic
In this webinar, learn how SnapLogic and Amazon Web Services helped Earth Networks create a responsive, self-service cloud for data integration, preparation and analytics.
We also discuss how Earth Networks gained faster data insights using SnapLogic’s Amazon Redshift data integration and other connectors to quickly integrate, transfer and analyze data from multiple applications.
To learn more, visit: www.snaplogic.com/redshift
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsKai Wähner
Slides from my talk at Codemotion Rome in March 2017. Development of analytic machine learning / deep learning models with R, Apache Spark ML, Tensorflow, H2O.ai, RapidMinder, KNIME and TIBCO Spotfire. Deployment to real time event processing / stream processing / streaming analytics engines like Apache Spark Streaming, Apache Flink, Kafka Streams, TIBCO StreamBase.
Consumption based analytics enabled by Data VirtualizationDenodo
Watch full webinar here: https://buff.ly/2NM5Jtf
An eclectic mix of old and new data drives every decision and every interaction, but too many organisations are attempting unsuccessfully to consolidate this data into a single repository which is time-consuming, resource-intensive, expensive, and risky.
Join this Denodo and HCL Webinar to discover how data virtualization provides an effective modern day architecture and an alternative to data consolidation and the challenges of fragmented data ecosystems and traditional integration approaches. We will share stories and provide multiple perspectives on best practices and solutions.
Content will include:
- Business use cases that highlight challenges and solutions that result in faster time-to-market and greater ROI.
- Suggested approaches to achieve extreme agility for competitive advantage.
This presentation was delivered at the BI SIG in Palo Alto. It provides an overview of the market shift away from on-premise solutions to on-demand in the business intelligence industry.
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
BI architecture drivers have to change to satisfy new requirements in format, volume, latency, hosting, analysis, reporting, and visualization. In this presentation delivered at the 2014 SATURN conference, SoftServe`s Serhiy and Olha showcased a number of reference architectures that address these challenges and speed up the design and implementation process, making it more predictable and economical:
- Traditional architecture based on an RDMBS data warehouse but modernized with column-based storage to handle a high load and capacity
- NoSQL-based architectures that address Big Data batch and stream-based processing and use popular NoSQL and complex event-processing solutions
- Hybrid architecture that combines traditional and NoSQL approaches to achieve completeness that would not be possible with either alone
The architectures are accompanied by real-life projects and case studies that the presenters have performed for multiple companies, including Fortune 100 and start-ups.
Watch this recorded webcast and listen to Infochimps CSO and Co-Founder, Dhruv Bansal, and Think Big Analytics Principal Architect, Douglas Moore, share successful use cases and recommendations for building real-time predictive analytics in your enterprise.
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the EnterpriseSnapLogic
In this webinar, we talk about our Fall 2014 release, which brings iPaaS to the enterprise by introducing data wrangling and significant SnapReduce enhancements for Hadoop 2.0 deployments.
We also discuss our newest features including Hadoop-enabled processing and big data acquisition, data mapping and shaping, hierarchical SmartLinking and new and updated Snaps.
To learn more, visit: http://www.snaplogic.com/fall2014
Learn why 451 Research believes Infochimps is well-positioned with an easy-to-consume managed service for those without Hadoop expertise, as well as a stack of technologically interesting projects for the 'devops' crowd.
Opening with a market positioning statement and ending with a competitive and SWOT analysis, Matt Aslett provides a comprehensive impact report.
Nubank is the leading fintech in Latin America. Using bleeding-edge technology, design, and data, the company aims to fight complexity and empower people to take control of their finances. We are disrupting an outdated and bureaucratic system by building a simple, safe and 100% digital environment.
In order to succeed, we need to constantly make better decisions in the speed of insight, and that’s what We aim when building Nubank’s Data Platform. In this talk we want to explore and share the guiding principles and how we created an automated, scalable, declarative and self-service platform that has more than 200 contributors, mostly non-technical, to build 8 thousand distinct datasets, ingesting data from 800 databases, leveraging Apache Spark expressiveness and scalability.
The topics we want to explore are:
– Making data-ingestion a no-brainer when creating new services
– Reducing the cycle time to deploy new Datasets and Machine Learning models to production
– Closing the loop and leverage knowledge processed in the analytical environment to take decisions in production
– Providing the perfect level of abstraction to users
You will get from this talk:
– Our love for ‘The Log’ and how we use it to decouple databases from its schema and distribute the work to keep schemas up to date to the entire team.
– How we made data ingestion so simple using Kafka Streams that teams stopped using databases for analytical data.
– The huge benefits of relying on the DataFrame API to create datasets which made possible having tests end-to-end verifying that the 8000 datasets work without even running a Spark Job and much more.
– The importance of creating the right amount of abstractions and restrictions to have the power to optimize.
Digital Shift in Insurance: How is the Industry Responding with the Influx of...DataWorks Summit
The digital connected world is having an impact on the technology environments that insurers must create to thrive in the new era of computing. The nature of customer interactions, business processes from product, risk and claims management are continuously changing. During this session we will review recent research and insights from insurance companies in the life, general and reinsurance markets and discuss the implications for insurers as the industry considers implications from core systems, predictive and preventive analytics and improvements to customer experiences.
Millions of dollars are being spent annually by the insurance industry in InsurTech investments from risk listening, customer interactions (chatbots, SMS messaging, smart interactive conversations), to methods of evaluating claims (digital capture at notice of incident, dashcams, connected homes/vehicles).
These are all new types of data which the industry hasn't previously had to manage and govern.
Additionally, at the heart of this is how to create new business opportunities from data. We will also have an interactive conversation on discussing and exploring insurance implications of the new computing environment from AI, Big Data and IoT (Edge computing).
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...Data Con LA
Abstract:- Apache Cassandra is known as the go-to database for cloud applications requiring large amounts of data storage with elastic scalability across multiple data centers. Spark is an in-memory analytics framework that supports both realtime and batch processing, with extensions for streaming, machine learning, and SQL. Jeff Carpenter, Technical Evangelist at DataStax, will share how DataStax Enterprise puts these powerful technologies together to solve common use cases in domains including entertainment and IoT. We’ll explore architectures for intelligent applications that leverage DSE to provide real-time operational analytics.
Ubiquitous data does not always translate to actionable data, though most financial institutions have a treasure trove of data they are moving to the cloud and could be using today. The potential is huge, but most struggle just to make actionable data available, let alone turn it into business value at scale. This session will highlight come of the key use cases and technologies that provide the greatest returns and organizational impact.
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...Kai Wähner
I discuss a good big data architecture which includes Data Warehouse / Business Intelligence + Apache Hadoop + Real Time / Stream Processing. Several real world example are shown. TIBCO offers some very nice products for realizing these use cases, e.g. Spotfire (Business Intelligence / BI), StreamBase (Stream Processing), BusinessEvents (Complex Event Processing / CEP) and BusinessWorks (Integration / ESB). TIBCO is also ready for Hadoop by offering connectors and plugins for many important Hadoop frameworks / interfaces such as HDFS, Pig, Hive, Impala, Apache Flume and more.
Presentation of my talk given in Phoenix Data Conference 2019. In this we will look at challenges with current Apache Hadoop ecosystem
Apache Hadoop is still relevant but way of doing Hadoop and enterprise data architecture has to be re-looked as we enter Cognitive and Cloud Native Era
We need
Architecture that is enabled by common run time layer across on premise and cloud
Architecture that can abstract away dependency and version conflicts with tons of open source machine learning out there. Yarn did not scale up in that aspect until one want to deal with multiple conda environment
Architecture that can enable real Hybrid Cloud and Multi Cloud portability
And many more challenges that one has to overcome to keep architecture simple, infrastructure agile and better utilized
SnapLogic is a US-based, venture-funded software company that is attempting to reinvent integration platform technology - by creating one unified platform that can address many different kinds of application and data integration use cases.
Cloud Modernization and Data as a Service OptionDenodo
Watch here: https://bit.ly/36tEThx
The current data landscape is fragmented, not just in location but also in terms of shape and processing paradigms. Cloud has become a key component of modern architecture design. Data lakes, IoT, NoSQL, SaaS, etc. coexist with relational databases to fuel the needs of modern analytics, ML and AI. Exploring and understanding the data available within your organization is a time-consuming task. Dealing with bureaucracy, different languages and protocols, and the definition of ingestion pipelines to load that data into your data lake can be complex. And all of this without even knowing if that data will be useful at all.
Attend this session to learn:
- How dynamic data challenges and the speed of change requires a new approach to data architecture – one that is real-time, agile and doesn’t rely on physical data movement.
- Learn how logical data architecture can enable organizations to transition data faster to the cloud with zero downtime and ultimately deliver faster time to insight.
- Explore how data as a service and other API management capabilities is a must in a hybrid cloud environment.
(Bjørn Kvernstuen + Tommy Jocumsen, Norwegian Directorate for Work and Welfare) Kafka Summit SF 2018
NAV (Norwegian Work and Welfare Department) currently distributes more than one third of the national budget to citizens in Norway or abroad. We’re there to assist people through all phases of life within the domains of work, family, health, retirement and social security. Events happening throughout a person’s life determines which services we provide to them, how we provide them and when we provide them.
Today, each person has to apply for these services resulting in many tasks that are largely handled manually by various case workers in the organization. Their access to insight and useful information is limited and often hard to find, causing frustration to both our case workers and our users. By streaming a person’s life events through our Kafka pipelines, we can revolutionize the way our users experience our services and the way we work.
NAV and the government as a whole have access to vast amounts of data about our citizens, reported by health institutions, employers, various government agencies or the users themselves. Some data is distributed by large batches, while others are available on-demand through APIs. We’re changing these patterns into streams using Kafka, Streams API and Java microservices. We aim to distribute and act on events about birth, death, relationships, employment, income and business processes to vastly improve the user experience, provide real-time insight and reduce the need to apply for services we already know are needed.
This talk will touch on the following topics:
-How we move from data-on-demand to streams
-How streams of life events will free our case workers from mundane tasks
-How life and business events make valuable insight
-How we protect our users and comply with GDPR
-Why we chose Confluent Platform
Hadoop for Humans: Introducing SnapReduce 2.0SnapLogic
In this webinar, we talk about Hadoop, big data and SnapReduce 2.0 with SnapLogic Chief Scientist Greg Benson, Professor of Computer Science at the University of San Francisco. This webinar features a dive into SnapReduce, and a discussion about how SnapLogic delivers big data acquisition, better big data preparation and universal big data delivery.
To learn more, visit: http://www.snaplogic.com/snapreduce
Big Data Management: What's New, What's Different, and What You Need To KnowSnapLogic
This presentation is from a recorded webinar with 451 Research analyst and thought leader Matt Aslett for a discussion about the growing importance of the right data management best practices and techniques for delivering on the promise of big data in the enterprise. Matt reviews the big data landscape, how the data lake complements and competes with the data warehouse, and key takeaways as you move from big data test and development environments to production. You can watch the webinar here: http://bit.ly/25ShiQu
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...SnapLogic
In this webinar, learn how SnapLogic and Amazon Web Services helped Earth Networks create a responsive, self-service cloud for data integration, preparation and analytics.
We also discuss how Earth Networks gained faster data insights using SnapLogic’s Amazon Redshift data integration and other connectors to quickly integrate, transfer and analyze data from multiple applications.
To learn more, visit: www.snaplogic.com/redshift
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsKai Wähner
Slides from my talk at Codemotion Rome in March 2017. Development of analytic machine learning / deep learning models with R, Apache Spark ML, Tensorflow, H2O.ai, RapidMinder, KNIME and TIBCO Spotfire. Deployment to real time event processing / stream processing / streaming analytics engines like Apache Spark Streaming, Apache Flink, Kafka Streams, TIBCO StreamBase.
Consumption based analytics enabled by Data VirtualizationDenodo
Watch full webinar here: https://buff.ly/2NM5Jtf
An eclectic mix of old and new data drives every decision and every interaction, but too many organisations are attempting unsuccessfully to consolidate this data into a single repository which is time-consuming, resource-intensive, expensive, and risky.
Join this Denodo and HCL Webinar to discover how data virtualization provides an effective modern day architecture and an alternative to data consolidation and the challenges of fragmented data ecosystems and traditional integration approaches. We will share stories and provide multiple perspectives on best practices and solutions.
Content will include:
- Business use cases that highlight challenges and solutions that result in faster time-to-market and greater ROI.
- Suggested approaches to achieve extreme agility for competitive advantage.
This presentation was delivered at the BI SIG in Palo Alto. It provides an overview of the market shift away from on-premise solutions to on-demand in the business intelligence industry.
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
BI architecture drivers have to change to satisfy new requirements in format, volume, latency, hosting, analysis, reporting, and visualization. In this presentation delivered at the 2014 SATURN conference, SoftServe`s Serhiy and Olha showcased a number of reference architectures that address these challenges and speed up the design and implementation process, making it more predictable and economical:
- Traditional architecture based on an RDMBS data warehouse but modernized with column-based storage to handle a high load and capacity
- NoSQL-based architectures that address Big Data batch and stream-based processing and use popular NoSQL and complex event-processing solutions
- Hybrid architecture that combines traditional and NoSQL approaches to achieve completeness that would not be possible with either alone
The architectures are accompanied by real-life projects and case studies that the presenters have performed for multiple companies, including Fortune 100 and start-ups.
Watch this recorded webcast and listen to Infochimps CSO and Co-Founder, Dhruv Bansal, and Think Big Analytics Principal Architect, Douglas Moore, share successful use cases and recommendations for building real-time predictive analytics in your enterprise.
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the EnterpriseSnapLogic
In this webinar, we talk about our Fall 2014 release, which brings iPaaS to the enterprise by introducing data wrangling and significant SnapReduce enhancements for Hadoop 2.0 deployments.
We also discuss our newest features including Hadoop-enabled processing and big data acquisition, data mapping and shaping, hierarchical SmartLinking and new and updated Snaps.
To learn more, visit: http://www.snaplogic.com/fall2014
Learn why 451 Research believes Infochimps is well-positioned with an easy-to-consume managed service for those without Hadoop expertise, as well as a stack of technologically interesting projects for the 'devops' crowd.
Opening with a market positioning statement and ending with a competitive and SWOT analysis, Matt Aslett provides a comprehensive impact report.
Nubank is the leading fintech in Latin America. Using bleeding-edge technology, design, and data, the company aims to fight complexity and empower people to take control of their finances. We are disrupting an outdated and bureaucratic system by building a simple, safe and 100% digital environment.
In order to succeed, we need to constantly make better decisions in the speed of insight, and that’s what We aim when building Nubank’s Data Platform. In this talk we want to explore and share the guiding principles and how we created an automated, scalable, declarative and self-service platform that has more than 200 contributors, mostly non-technical, to build 8 thousand distinct datasets, ingesting data from 800 databases, leveraging Apache Spark expressiveness and scalability.
The topics we want to explore are:
– Making data-ingestion a no-brainer when creating new services
– Reducing the cycle time to deploy new Datasets and Machine Learning models to production
– Closing the loop and leverage knowledge processed in the analytical environment to take decisions in production
– Providing the perfect level of abstraction to users
You will get from this talk:
– Our love for ‘The Log’ and how we use it to decouple databases from its schema and distribute the work to keep schemas up to date to the entire team.
– How we made data ingestion so simple using Kafka Streams that teams stopped using databases for analytical data.
– The huge benefits of relying on the DataFrame API to create datasets which made possible having tests end-to-end verifying that the 8000 datasets work without even running a Spark Job and much more.
– The importance of creating the right amount of abstractions and restrictions to have the power to optimize.
Digital Shift in Insurance: How is the Industry Responding with the Influx of...DataWorks Summit
The digital connected world is having an impact on the technology environments that insurers must create to thrive in the new era of computing. The nature of customer interactions, business processes from product, risk and claims management are continuously changing. During this session we will review recent research and insights from insurance companies in the life, general and reinsurance markets and discuss the implications for insurers as the industry considers implications from core systems, predictive and preventive analytics and improvements to customer experiences.
Millions of dollars are being spent annually by the insurance industry in InsurTech investments from risk listening, customer interactions (chatbots, SMS messaging, smart interactive conversations), to methods of evaluating claims (digital capture at notice of incident, dashcams, connected homes/vehicles).
These are all new types of data which the industry hasn't previously had to manage and govern.
Additionally, at the heart of this is how to create new business opportunities from data. We will also have an interactive conversation on discussing and exploring insurance implications of the new computing environment from AI, Big Data and IoT (Edge computing).
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...Data Con LA
Abstract:- Apache Cassandra is known as the go-to database for cloud applications requiring large amounts of data storage with elastic scalability across multiple data centers. Spark is an in-memory analytics framework that supports both realtime and batch processing, with extensions for streaming, machine learning, and SQL. Jeff Carpenter, Technical Evangelist at DataStax, will share how DataStax Enterprise puts these powerful technologies together to solve common use cases in domains including entertainment and IoT. We’ll explore architectures for intelligent applications that leverage DSE to provide real-time operational analytics.
Ubiquitous data does not always translate to actionable data, though most financial institutions have a treasure trove of data they are moving to the cloud and could be using today. The potential is huge, but most struggle just to make actionable data available, let alone turn it into business value at scale. This session will highlight come of the key use cases and technologies that provide the greatest returns and organizational impact.
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...Kai Wähner
I discuss a good big data architecture which includes Data Warehouse / Business Intelligence + Apache Hadoop + Real Time / Stream Processing. Several real world example are shown. TIBCO offers some very nice products for realizing these use cases, e.g. Spotfire (Business Intelligence / BI), StreamBase (Stream Processing), BusinessEvents (Complex Event Processing / CEP) and BusinessWorks (Integration / ESB). TIBCO is also ready for Hadoop by offering connectors and plugins for many important Hadoop frameworks / interfaces such as HDFS, Pig, Hive, Impala, Apache Flume and more.
Introduction to GCP DataFlow PresentationKnoldus Inc.
In this session, we will learn about how Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing.
Machine learning at scale with Google Cloud PlatformMatthias Feys
Machine Learning typically involves big datasets and lots of model iterations. This presentation shows how to use GCP to speed up that process with ML Engine and Dataflow. The focus of the presentation is on tooling not on models or business cases.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
Following the popularity of “Cloud Revolution: Exploring the New Wave of Serverless Spatial Data,” we’re thrilled to announce this much-anticipated encore webinar.
In this sequel, we’ll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR.
Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios.
Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects.
Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you’re building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.
Building Pinterest Real-Time Ads Platform Using Kafka Streams confluent
Building Pinterest Real-Time Ads Platform Using Kafka Streams (Liquan Pei + Boyang Chen, Pinterest) Kafka Summit SF 2018
In this talk, we are sharing the experience of building Pinterest’s real-time Ads Platform utilizing Kafka Streams. The real-time budgeting system is the most mission-critical component of the Ads Platform as it controls how each ad is delivered to maximize user, advertiser and Pinterest value. The system needs to handle over 50,000 queries per section (QPS) impressions, requires less than five seconds of end-to-end latency and recovers within five minutes during outages. It also needs to be scalable to handle the fast growth of Pinterest’s ads business.
The real-time budgeting system is composed of real-time stream-stream joiner, real-time spend aggregator and a spend predictor. At Pinterest’s scale, we need to overcome quite a few challenges to make each component work. For example, the stream-stream joiner needs to maintain terabyte size state while supporting fast recovery, and the real-time spend aggregator needs to publish to thousands of ads servers while supporting over one million read QPS. We choose Kafka Streams as it provides milliseconds latency guarantee, scalable event-based processing and easy-to-use APIs. In the process of building the system, we performed tons of tuning to RocksDB, Kafka Producer and Consumer, and pushed several open source contributions to Apache Kafka. We are also working on adding a remote checkpoint for Kafka Streams state to reduce the time of code start when adding more machines to the application. We believe that our experience can be beneficial to people who want to build real-time streaming solutions at large scale and deeply understand Kafka Streams.
[Study Guide] Google Professional Cloud Architect (GCP-PCA) CertificationAmaaira Johns
Start Here---> https://bit.ly/3bGEd9l <---Get complete detail on GCP-PCA exam guide to crack Professional Cloud Architect. You can collect all information on GCP-PCA tutorial, practice test, books, study material, exam questions, and syllabus. Firm your knowledge on Professional Cloud Architect and get ready to crack GCP-PCA certification. Explore all information on GCP-PCA exam with the number of questions, passing percentage, and time duration to complete the test.
Robert Bates, SVP Sales Engineering of Crunchy Data explains how you can tackle Data Gravity, Kubernetes, and strategies/best practices to run, scale, and leverage stateful containers in production.
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
Open source is at the heart of what we do at Grafana Labs and there is so much happening! The intent of this talk to update everyone on the latest development when it comes to Grafana, Pyroscope, Faro, Loki, Mimir, Tempo and more. Everyone has had at least heard about Grafana but maybe some of the other projects mentioned above are new to you? Welcome to this talk 😉 Beside the update what is new we will also quickly introduce them during this talk.
How we have used ansible for real-time industry use cases and Integration with enterprise tools. Infra provisioning and config management using ansible and automating routine tasks.
Similar to Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020 (20)
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
2. Who am I?
Mariano is an engineer with more than 15 years of
experience with the JVM. He enjoys working with
and exploring a variety of big data technologies. He is
an avid open-source contributor.
Data/Platform Architect at Otus
Mariano Gonzalez
Most importantly, I am just a person trying to learn about and share big
data technologies and approaches.
3. Agenda
● Goal for this session
● Overview of GCP services
● Apache Beam and GCP Dataflow
● Natural Language Processing for sentiment analysis
● Demo ETL/Analytics
● QA
4. Goal for this Session
Find an elegant way to build and deploy data/analytic
pipelines that:
● Support for multiple workloads
● Scale compute and storage independently
● Backed up by manage services
● Cost effective
6. Overview of GCP services - App Engine
● Good alternative if K8s infrastructure is not in place
● Easy deployment
○ Similar to AWS SAM from a CLI perspective
○ Similar to AWS Beanstalk from a deployment perspective
● Well integrated with other cloud services
○ GCP docker Registry
● Multiple Runtimes
○ Custom (Docker)
○ JVM/Node/Python
7. Overview of GCP services - Storage
● Hot - durable, available performance object storage for frequently accessed data
○ Amazon S3 Standard
○ Microsoft Azure Hot Blob Storage
○ Google Cloud Storage standard
● Cool - storage class for data that is accessed less frequently, but requires rapid access
when needed
○ Amazon S3 Standard I/A and S3 Standard Z-I/A
○ Microsoft Azure Cool Blob Storage
○ Google Cloud Storage Nearline
● Cold - secure, durable, and low-cost storage service for data archiving
○ Amazon S3 Glacier
○ Microsoft Azure Blob Archive Storage
○ Google Cloud Storage Coldline
8. Overview of GCP services - Pubsub
Why not just use Kafka?
● Fully managed services
○ Both system can have fully managed version in the cloud
● Cloud vs On-prem
○ Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka
can be used as a both cloud service and on-prem service
● Message duplication
○ Kafka manage the offsets via zookeeper
○ Pubsub works using acknowledging the message
9. Overview of GCP services - Pubsub
Why not just use Kafka?
● Retention policy
○ Both Kafka and Pubsub have options to configure the maximum retention
time
● Consumers Group vs Subscriptions
○ Pubsub use subscriptions, you create a subscription and then you start
reading messages from that subscription
○ Kafka use the concept of "consumer group" and "partition"
10. Overview of GCP services - BigQuery
● Query engines probably one of the most competed service today:
○ Snowflake
○ Presto
○ Redshift
● How are these warehouses different?
11. ● Presto
○ Self hosted open source solution
● Pre-RA3 Redshift
○ Somewhat more fully managed, but still requires the user to configure individual
compute clusters with a fixed amount of memory, compute and storage
● Redshift RA3
○ Closer to the user experience of Snowflake by separating compute from storage
● Snowflake
○ The user only configures the size and number of compute clusters
○ Every compute cluster sees the same data
○ Compute clusters can be created and removed in seconds
Overview of GCP services - BigQuery
12. BigQuery
● Flat-rate is similar to Snowflake except there is no concept of a compute cluster, just a configurable number
of "compute slots"
● Pure serverless model, where the user submits queries one at a time and pays per query
● On-demand mode can be much more expensive, or much cheaper, depending on the nature of your
workload
A "steady" workload that utilizes your compute capacity 24/7 will be much cheaper in flat-rate mode. A
"spiky" workload that contains periodic large queries spaced with long periods of idleness or lower utilization
will be much cheaper in on-demand mode.
Overview of GCP services - BigQuery
13. What is Google Cloud Dataflow?
● Data processing service for both:
○ batch
○ real-time data streaming applications
● Benefits
○ Enables developers to set up analytic pipelines immediately
● Nextgen MapReduce
○ Designed to bring to an entire analytics pipelines the style of fast parallel execution that MapReduce
brought to a single type of computational for batch processing jobs
○ It's based partly on MillWheel and Flume (two Google-developed data ingestion and low-latency
processing).
Overview of GCP services - Dataflow
14. Apache Beam SDK and Dataflow Runner
Google Cloud Dataflow overlaps with services such as:
● Amazon Kinesis
● Apache Storm
● Apache Spark
● Facebook Flux
$ java -jar build/libs/transformation-1.0-all.jar
--project=ccc-2020-289323
--runner=DataflowRunner
--streaming=true
--region=us-east1
--tempLocation=gs://chicago-cloud-conference-2020/temp/
--stagingLocation=gs://chicago-cloud-conference-2020/jars/
--filesToStage=build/libs/transformation-1.0-all.jar
--maxNumWorkers=2
--numWorkers=1
16. Overview of GCP services - Dataproc
On demand Hadoop Cluster
● From all the 3 managed services for Hadoop Clusters (Amazon EMR, Azure Hdinsight)
Dataproc is the fastest to provision
● Easy runtime customization via PIP commands
● Not as well integrated with third party services (Azure Hdinsight - Databricks, Amazon EMR
- Apache Zeppelin)
$ gcloud beta dataproc clusters create cluster-name
--optional-components=ANACONDA,JUPYTER
--image-version=1.4
--enable-component-gateway
--bucket=chicago-cloud-conference-2020
--region=us-east1
--project=ccc-2020-289323
--metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage numpy pandas matplotlib'
17. Overview of GCP services - Cloud Natural Language API
● What can we do Cloud Natural Language API?
○ Reveal the structure and meaning of text via machine learning models
○ Extract information about people, places, and events, mentioned in text
documents, news articles or blog posts
○ Understand sentiment about product on social media or parse intent from
customer conversations happening in a call center or a messaging app
● How can we use it?
○ Analyze text uploaded as part of a HTTP request
○ Integrate with Google Cloud Storage
18. NLP - Sentiment Analysis
Two type of metrics to consider:
1. Score
a. It ranges between -1.0 (negative) and
1.0 (positive) and corresponds to the
general emotional tendency of the text
1. Magnitude
a. Indicates the general intensity of
emotion (both positive and negative) in
a given text, between 0.0 and inf
b. Magnitude is not normalized and each
expression of emotion in the text (both
positive and negative) contributes to the
value
Sentiment Sample Values
Positive score: 0.8, magnitude: 3.0
Negative score: -0.6, magnitude: 4.0
Neutral score: 0.1, magnitude: 0.0
Mixed score: 0.0, magnitude: 4.0
19. Demo - ETL
• Extract – Diferentes fuentes (Twitter for this case)
• Transform – Cleanup and data presentation
• Load – Columnar format
https://github.com/eschizoid/ccc-2020