Lowering the entry point to getting going with Hadoop and obtaining business ...DataWorks Summit
SAS is a leader in advanced analytics with over 40 years of experience. They provide tools to manage, explore, develop models, and deploy analytics from, with, and within Hadoop. This allows customers to realize value from Hadoop throughout the entire analytics lifecycle. SAS helps address challenges like Hadoop skills shortages and tools not being optimized for big data. They demonstrated identifying reasons for abandoned shopping carts using Hadoop and SAS analytics tools.
How to get started in Big Data without Big Costs - StampedeCon 2016StampedeCon
Looking to implement Hadoop but haven’t pulled the trigger yet? You are not alone. Many companies have heard the hype about how Hadoop can solve the challenges presented by big data, but few have actually implemented it. What’s preventing them from taking the plunge? Can it be done in small steps to ensure project success?
This session will discuss some of the items to consider when getting started with Hadoop and how to go about making the decision to move to the de facto big data platform. Starting small can be a good approach when your company is learning the basics and deciding what direction to take. There is no need to invest large amounts of time and money up front if a proof of concept is all you aim to provide. Using well known data sets on virtual machines can provide a low cost and effort implementation to know if your big data journey will be successful with Hadoop.
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon
Hadoop adoption is a journey. Depending on the business the process can take weeks, months, or even years. Hadoop is a transformative technology so the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. There are challenges for companies who have lived with an application driven business for the last two decades to suddenly become data driven. Companies need to begin thinking less in terms of single, silo’d servers and more about “the cluster”.
The concept of the cluster becomes the center of data gravity drawing all the applications to it. Companies, especially the IT organizations, embark on a process of understanding how to maintain and operationalize this environment and provide the data lake as a service to the businesses. They must empower the business by providing the resources for the use cases which drive both renovation and innovation. IT needs to adopt new technologies and new methodologies which enable the solutions. This is not technology for technology sake. Hadoop is a data platform servicing and enabling all facets of an organization. Building out and expanding this platform is the ongoing journey as word gets out to businesses that they can have any data they want and any time. Success is what drives the journey.
The length of the journey varies from company to company. Sometimes the challenges are based on the size of the company but many times the challenges are based on the difficulty of unseating established IT processes companies have adopted without forethought for the past two decades. Companies must navigate through the noise. Sifting through the noise to find those solutions which bring real value takes time. As the platform matures and becomes mainstream, more and more companies are finding it easier to adopt Hadoop. Hundreds of companies have already taken many steps; hundreds more have already taken the first step. As the wave of successful Hadoop adoption continues, more and more companies will see the value in starting the journey and paving the way for others.
This document discusses strategies for successfully utilizing a data lake. It notes that creating a data lake is just the beginning and that challenges include data governance, metadata management, access, and effective use of the data. The document advocates for data democratization through discovery, accessibility, and usability. It also discusses best practices like self-service BI and automated workload migration from data warehouses to reduce costs and risks. The key is to address the "data lake dilemma" of these challenges to avoid a "data swamp" and slow adoption.
1) Before Netezza, data analysis took a long time due to large datasets and restrictions. Productivity was low.
2) SAS software integrates with Netezza to enable faster analytics on large datasets without constraints.
3) The integration allows scoring algorithms and transforms to run directly on the Netezza database, improving performance and reducing data movement compared to traditional architectures.
This document provides information about Aetna, a health insurance company. It summarizes that Aetna serves about 46 million customers to help them make healthcare decisions and manage healthcare spending. Aetna offers various medical, pharmacy, dental, life, and disability insurance plans as well as Medicaid services and behavioral health programs. As of March 2015, Aetna had approximately 23.7 million medical members, 15.5 million dental members, and 15.4 million pharmacy members. Aetna works with over 1.1 million healthcare professionals across more than 674,000 primary care doctors and specialists located in 5,589 hospitals across the US and globally.
SAS and Netezza Enzee universe presentation_20_june2011Pavel Zhivulin
SAS is a leader in analytics software with over $2 billion in annual revenue. It has partnerships with Netezza to integrate SAS analytics capabilities with Netezza's high performance data warehouse appliances. Key products of this partnership include SAS Access to Netezza for optimized data transfer and SAS Scoring Accelerator for Netezza which deploys SAS predictive models directly in the Netezza database for faster, more scalable scoring.
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
Lowering the entry point to getting going with Hadoop and obtaining business ...DataWorks Summit
SAS is a leader in advanced analytics with over 40 years of experience. They provide tools to manage, explore, develop models, and deploy analytics from, with, and within Hadoop. This allows customers to realize value from Hadoop throughout the entire analytics lifecycle. SAS helps address challenges like Hadoop skills shortages and tools not being optimized for big data. They demonstrated identifying reasons for abandoned shopping carts using Hadoop and SAS analytics tools.
How to get started in Big Data without Big Costs - StampedeCon 2016StampedeCon
Looking to implement Hadoop but haven’t pulled the trigger yet? You are not alone. Many companies have heard the hype about how Hadoop can solve the challenges presented by big data, but few have actually implemented it. What’s preventing them from taking the plunge? Can it be done in small steps to ensure project success?
This session will discuss some of the items to consider when getting started with Hadoop and how to go about making the decision to move to the de facto big data platform. Starting small can be a good approach when your company is learning the basics and deciding what direction to take. There is no need to invest large amounts of time and money up front if a proof of concept is all you aim to provide. Using well known data sets on virtual machines can provide a low cost and effort implementation to know if your big data journey will be successful with Hadoop.
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon
Hadoop adoption is a journey. Depending on the business the process can take weeks, months, or even years. Hadoop is a transformative technology so the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. There are challenges for companies who have lived with an application driven business for the last two decades to suddenly become data driven. Companies need to begin thinking less in terms of single, silo’d servers and more about “the cluster”.
The concept of the cluster becomes the center of data gravity drawing all the applications to it. Companies, especially the IT organizations, embark on a process of understanding how to maintain and operationalize this environment and provide the data lake as a service to the businesses. They must empower the business by providing the resources for the use cases which drive both renovation and innovation. IT needs to adopt new technologies and new methodologies which enable the solutions. This is not technology for technology sake. Hadoop is a data platform servicing and enabling all facets of an organization. Building out and expanding this platform is the ongoing journey as word gets out to businesses that they can have any data they want and any time. Success is what drives the journey.
The length of the journey varies from company to company. Sometimes the challenges are based on the size of the company but many times the challenges are based on the difficulty of unseating established IT processes companies have adopted without forethought for the past two decades. Companies must navigate through the noise. Sifting through the noise to find those solutions which bring real value takes time. As the platform matures and becomes mainstream, more and more companies are finding it easier to adopt Hadoop. Hundreds of companies have already taken many steps; hundreds more have already taken the first step. As the wave of successful Hadoop adoption continues, more and more companies will see the value in starting the journey and paving the way for others.
This document discusses strategies for successfully utilizing a data lake. It notes that creating a data lake is just the beginning and that challenges include data governance, metadata management, access, and effective use of the data. The document advocates for data democratization through discovery, accessibility, and usability. It also discusses best practices like self-service BI and automated workload migration from data warehouses to reduce costs and risks. The key is to address the "data lake dilemma" of these challenges to avoid a "data swamp" and slow adoption.
1) Before Netezza, data analysis took a long time due to large datasets and restrictions. Productivity was low.
2) SAS software integrates with Netezza to enable faster analytics on large datasets without constraints.
3) The integration allows scoring algorithms and transforms to run directly on the Netezza database, improving performance and reducing data movement compared to traditional architectures.
This document provides information about Aetna, a health insurance company. It summarizes that Aetna serves about 46 million customers to help them make healthcare decisions and manage healthcare spending. Aetna offers various medical, pharmacy, dental, life, and disability insurance plans as well as Medicaid services and behavioral health programs. As of March 2015, Aetna had approximately 23.7 million medical members, 15.5 million dental members, and 15.4 million pharmacy members. Aetna works with over 1.1 million healthcare professionals across more than 674,000 primary care doctors and specialists located in 5,589 hospitals across the US and globally.
SAS and Netezza Enzee universe presentation_20_june2011Pavel Zhivulin
SAS is a leader in analytics software with over $2 billion in annual revenue. It has partnerships with Netezza to integrate SAS analytics capabilities with Netezza's high performance data warehouse appliances. Key products of this partnership include SAS Access to Netezza for optimized data transfer and SAS Scoring Accelerator for Netezza which deploys SAS predictive models directly in the Netezza database for faster, more scalable scoring.
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariHortonworks
Teradata Viewpoint provides a unified monitoring solution for Teradata Database, Aster, and Hadoop. It integrates with Ambari to simplify monitoring Hadoop. Viewpoint uses Ambari's REST APIs to collect metrics and alerts from Hadoop and store them in a database for trend analysis and visualization. This allows Viewpoint to deliver comprehensive Hadoop monitoring without having to understand its various monitoring technologies.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
Learn about SAS and Cloudera technical integration, how SAS builds on the enterprise data hub, and SAS In-Memory Statistics for Hadoop, machine learning capabilities.
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016StampedeCon
This document discusses building a data pipeline using tools from the Apache Hadoop ecosystem. It begins with an introduction to the speaker and why Hadoop is useful for data pipelines. It then provides a matrix comparing the different Hadoop distributions and their included components. It outlines the various tiers of projects in the Hadoop ecosystem and disclaims any completeness. It also presents the typical data lifecycle of capture, enrichment, analysis, presentation, reporting, archival and removal. The document concludes with a reference to demo code and soliciting questions.
This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
This document discusses deploying a governed data lake using Hadoop and Waterline Data Inventory. It begins by outlining the benefits of a data lake and differences between data lakes and data warehouses. It then discusses using Hadoop as the platform for the data lake and some challenges around governance, scale, and usability. The document proposes a three phase approach using Waterline Data Inventory to organize, inventory, and open up the data lake. It provides screenshots and descriptions of Waterline's key capabilities like metadata discovery, data profiling, sensitive data identification, governance tools, and self-service catalog. It also includes an overview of Waterline Data as a company.
This document discusses architecting Hadoop for adoption and data applications. It begins by explaining how traditional systems struggle as data volumes increase and how Hadoop can help address this issue. Potential Hadoop use cases are presented such as file archiving, data analytics, and ETL offloading. Total cost of ownership (TCO) is discussed for each use case. The document then covers important considerations for deploying Hadoop such as hardware selection, team structure, and impact across the organization. Lastly, it discusses lessons learned and the need for self-service tools going forward.
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopEric Sun
Teradata Connectors for Hadoop enable high-volume data movement between Teradata and Hadoop platforms. LinkedIn conducted a proof-of-concept using the connectors for use cases like copying clickstream data from Hadoop to Teradata for analytics and publishing dimension tables from Teradata to Hadoop for machine learning. The connectors help address challenges of scalability and tight processing windows for these large-scale data transfers.
Hortonworks Oracle Big Data Integration Hortonworks
Slides from joint Hortonworks and Oracle webinar on November 11, 2014. Covers the Modern Data Architecture with Apache Hadoop and Oracle Data Integration products.
The document provides an agenda and overview of SAP Vora 1.4. It discusses SAP Vora's role in big data and data lakes, how it addresses challenges with big data, and its usage patterns across different industries like financial services, telecommunications, oil and gas, retail, and manufacturing. Key points include that SAP Vora leverages Hadoop and Spark for scalable and affordable big data storage and processing, provides a unified access layer and simplified data modeling for different data sources, and seamlessly integrates with SAP HANA for enterprise-grade analytics.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
Operational Analytics Using Spark and NoSQL Data StoresDATAVERSITY
NoSQL data stores have emerged for scalable capture and real-time analysis of data. Apache Spark and Hadoop provide additional scalable analytics processing. This session looks at these technologies and how they can be used to support operational analytics to improve operational effectiveness. It also looks at an example of how operational analytics can be implemented in NoSQL environments using the Basho Data Platform with Apache Spark:
•The emergence of NoSQL, Hadoop and Apache Spark
•NoSQL Use Cases
•The need for operational analytics
•Types of operational analysis
•Key requirements for operational analytics
•Operational analytics using the Basho Data Platform with Apache Spark.
The document discusses Teradata's portfolio for Hadoop, including the Teradata Aster Big Analytics Appliance, the Teradata Appliance for Hadoop, a commodity offering with Dell, and support for the Hortonworks Data Platform. It provides consulting, training, support, and managed services for Hadoop. Teradata SQL-H gives business users standard SQL access to data stored in Hadoop through Teradata, allowing queries to run quickly on Teradata while accessing data from Hadoop efficiently through HCatalog.
Expand a Data warehouse with Hadoop and Big Datajdijcks
After investing years in the data warehouse, are you now supposed to start over? Nope. This session discusses how to leverage Hadoop and big data technologies to augment the data warehouse with new data, new capabilities and new business models.
This document discusses navigating user data management and data discovery. It provides an overview of evaluating and selecting data management tools for a Hadoop data lake. Key criteria for evaluation include metadata curation, lineage and versioning, integration capabilities, and performance. Several vendors were evaluated, with Global ID, Attivio, and Waterline Data scoring highest based on the criteria. The presentation emphasizes selecting a limited number of tools based on business and user requirements.
This document discusses how data science and AI are fueling new business models driven by data. It summarizes that (1) connected devices, customers, and sensors are generating massive amounts of data across manufacturing, distribution, marketing, sales, and service; (2) technologies like cloud computing, streaming data, IoT, and machine learning are enabling new ways to harness this data; and (3) a modern data architecture is needed to encompass all data sources, enable analytics and machine learning, and power actionable intelligence across edge, cloud, and on-premises environments.
The document provides an overview of Apache Hadoop and how it addresses challenges with traditional data architectures. It discusses how Hadoop uses HDFS for distributed storage and YARN as a data operating system to allow for distributed computing. It also summarizes different data access methods in Hadoop including MapReduce for batch processing and how the Hadoop ecosystem continues to evolve and include technologies like Spark, Hive and HBase.
Managing Enterprise Hadoop Clusters with Apache AmbariHortonworks
This document discusses Apache Ambari, an open-source platform for managing Hadoop clusters. It provides an overview of Ambari, describing its key features including stacks, blueprints, views and extensibility points. It also demonstrates Ambari's capabilities for cluster deployment, management, monitoring and upgrades through a stack and blueprint demo.
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariHortonworks
Teradata Viewpoint provides a unified monitoring solution for Teradata Database, Aster, and Hadoop. It integrates with Ambari to simplify monitoring Hadoop. Viewpoint uses Ambari's REST APIs to collect metrics and alerts from Hadoop and store them in a database for trend analysis and visualization. This allows Viewpoint to deliver comprehensive Hadoop monitoring without having to understand its various monitoring technologies.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
Learn about SAS and Cloudera technical integration, how SAS builds on the enterprise data hub, and SAS In-Memory Statistics for Hadoop, machine learning capabilities.
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016StampedeCon
This document discusses building a data pipeline using tools from the Apache Hadoop ecosystem. It begins with an introduction to the speaker and why Hadoop is useful for data pipelines. It then provides a matrix comparing the different Hadoop distributions and their included components. It outlines the various tiers of projects in the Hadoop ecosystem and disclaims any completeness. It also presents the typical data lifecycle of capture, enrichment, analysis, presentation, reporting, archival and removal. The document concludes with a reference to demo code and soliciting questions.
This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
This document discusses deploying a governed data lake using Hadoop and Waterline Data Inventory. It begins by outlining the benefits of a data lake and differences between data lakes and data warehouses. It then discusses using Hadoop as the platform for the data lake and some challenges around governance, scale, and usability. The document proposes a three phase approach using Waterline Data Inventory to organize, inventory, and open up the data lake. It provides screenshots and descriptions of Waterline's key capabilities like metadata discovery, data profiling, sensitive data identification, governance tools, and self-service catalog. It also includes an overview of Waterline Data as a company.
This document discusses architecting Hadoop for adoption and data applications. It begins by explaining how traditional systems struggle as data volumes increase and how Hadoop can help address this issue. Potential Hadoop use cases are presented such as file archiving, data analytics, and ETL offloading. Total cost of ownership (TCO) is discussed for each use case. The document then covers important considerations for deploying Hadoop such as hardware selection, team structure, and impact across the organization. Lastly, it discusses lessons learned and the need for self-service tools going forward.
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopEric Sun
Teradata Connectors for Hadoop enable high-volume data movement between Teradata and Hadoop platforms. LinkedIn conducted a proof-of-concept using the connectors for use cases like copying clickstream data from Hadoop to Teradata for analytics and publishing dimension tables from Teradata to Hadoop for machine learning. The connectors help address challenges of scalability and tight processing windows for these large-scale data transfers.
Hortonworks Oracle Big Data Integration Hortonworks
Slides from joint Hortonworks and Oracle webinar on November 11, 2014. Covers the Modern Data Architecture with Apache Hadoop and Oracle Data Integration products.
The document provides an agenda and overview of SAP Vora 1.4. It discusses SAP Vora's role in big data and data lakes, how it addresses challenges with big data, and its usage patterns across different industries like financial services, telecommunications, oil and gas, retail, and manufacturing. Key points include that SAP Vora leverages Hadoop and Spark for scalable and affordable big data storage and processing, provides a unified access layer and simplified data modeling for different data sources, and seamlessly integrates with SAP HANA for enterprise-grade analytics.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
Operational Analytics Using Spark and NoSQL Data StoresDATAVERSITY
NoSQL data stores have emerged for scalable capture and real-time analysis of data. Apache Spark and Hadoop provide additional scalable analytics processing. This session looks at these technologies and how they can be used to support operational analytics to improve operational effectiveness. It also looks at an example of how operational analytics can be implemented in NoSQL environments using the Basho Data Platform with Apache Spark:
•The emergence of NoSQL, Hadoop and Apache Spark
•NoSQL Use Cases
•The need for operational analytics
•Types of operational analysis
•Key requirements for operational analytics
•Operational analytics using the Basho Data Platform with Apache Spark.
The document discusses Teradata's portfolio for Hadoop, including the Teradata Aster Big Analytics Appliance, the Teradata Appliance for Hadoop, a commodity offering with Dell, and support for the Hortonworks Data Platform. It provides consulting, training, support, and managed services for Hadoop. Teradata SQL-H gives business users standard SQL access to data stored in Hadoop through Teradata, allowing queries to run quickly on Teradata while accessing data from Hadoop efficiently through HCatalog.
Expand a Data warehouse with Hadoop and Big Datajdijcks
After investing years in the data warehouse, are you now supposed to start over? Nope. This session discusses how to leverage Hadoop and big data technologies to augment the data warehouse with new data, new capabilities and new business models.
This document discusses navigating user data management and data discovery. It provides an overview of evaluating and selecting data management tools for a Hadoop data lake. Key criteria for evaluation include metadata curation, lineage and versioning, integration capabilities, and performance. Several vendors were evaluated, with Global ID, Attivio, and Waterline Data scoring highest based on the criteria. The presentation emphasizes selecting a limited number of tools based on business and user requirements.
This document discusses how data science and AI are fueling new business models driven by data. It summarizes that (1) connected devices, customers, and sensors are generating massive amounts of data across manufacturing, distribution, marketing, sales, and service; (2) technologies like cloud computing, streaming data, IoT, and machine learning are enabling new ways to harness this data; and (3) a modern data architecture is needed to encompass all data sources, enable analytics and machine learning, and power actionable intelligence across edge, cloud, and on-premises environments.
The document provides an overview of Apache Hadoop and how it addresses challenges with traditional data architectures. It discusses how Hadoop uses HDFS for distributed storage and YARN as a data operating system to allow for distributed computing. It also summarizes different data access methods in Hadoop including MapReduce for batch processing and how the Hadoop ecosystem continues to evolve and include technologies like Spark, Hive and HBase.
Managing Enterprise Hadoop Clusters with Apache AmbariHortonworks
This document discusses Apache Ambari, an open-source platform for managing Hadoop clusters. It provides an overview of Ambari, describing its key features including stacks, blueprints, views and extensibility points. It also demonstrates Ambari's capabilities for cluster deployment, management, monitoring and upgrades through a stack and blueprint demo.
This document discusses organizing data in a data lake or "data reservoir". It describes the changing data landscape with multiple platforms for different analytical workloads. It outlines issues with the current siloed approach to data integration and management. The document introduces the concept of a data reservoir - a collaborative, governed environment for rapidly producing information. Key capabilities of a data reservoir include data collection, classification, governance, refinery, consumption, and virtualization. It describes how a data reservoir uses zones to organize data at different stages and uses workflows and an information catalog to manage the information production process across the reservoir.
This document discusses new features in SAP HANA SPS 10 for Hadoop and Spark integration, including a native Spark SQL integration using a Spark adapter, Ambari integration with the HANA cockpit for unified administration of HANA and Hadoop nodes, and data lifecycle management between HANA and Hadoop using a relocation agent. It also provides steps for configuring the Spark controller and details the Ambari integration with the HANA cockpit.
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
This document discusses building a big data analytics data lake. It begins with an overview of what a data lake is and the benefits it provides like quick data ingestion without schemas and storing all data in one centralized location. It then discusses important capabilities like ingestion, storage, cataloging, search, security and access controls. The document provides an example of how biotech company AMGEN built their own data lake on AWS. It concludes with a demonstration of an AWS data lake solution package that can be deployed via CloudFormation to build an initial data lake.
This document provides an overview of Apache Atlas and how it addresses big data governance issues for enterprises. It discusses how Atlas provides a centralized metadata repository that allows users to understand data across Hadoop components. It also describes how Atlas integrates with Apache Ranger to enable dynamic security policies based on metadata tags. Finally, it outlines new capabilities in upcoming Atlas releases, including cross-component data lineage tracking and a business taxonomy/catalog.
Presentation big dataappliance-overview_oow_v3xKinAnx
The document outlines Oracle's Big Data Appliance product. It discusses how businesses can use big data to gain insights and make better decisions. It then provides an overview of big data technologies like Hadoop and NoSQL databases. The rest of the document details the hardware, software, and applications that come pre-installed on Oracle's Big Data Appliance - including Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and tools for loading and analyzing data. The summary states that the Big Data Appliance provides a complete, optimized solution for storing and analyzing less structured data, and integrates with Oracle Exadata for combined analysis of all data sources.
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
Overview of Apache Trafodion (incubating), Enterprise Class Transactional SQL-on-Hadoop DBMS, with operational use cases, what it takes to be a world class RDBMS, some performance information, and the new company Esgyn which will leverage Apache Trafodion for operational solutions.
Enterprise Hadoop is Here to Stay: Plan Your Evolution StrategyInside Analysis
The Briefing Room with Neil Raden and Teradata
Live Webcast on August 19, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=1acd0b7ace309f765dc3196001d26a5e
Modern enterprises have been able to solve information management woes with the data warehouse, now a staple across the IT landscape that has evolved to a high level of sophistication and maturity with thousands of global implementations. Today’s modern enterprise has a similar challenge; big data and the fast evolution of the Hadoop ecosystem create plenty of new opportunities but also a significant number of operational pains as new solutions emerge.
Register for this episode of The Briefing Room to hear veteran Analyst Neil Raden as he explores the details and nature of Hadoop’s evolution. He’ll be briefed by Cesar Rojas of Teradata, who will share how Teradata solves some of the Hadoop operational challenges. He will also explain how the integration between Hadoop and the data warehouse can help organizations develop a more responsive and robust data management environment.
Visit InsideAnlaysis.com for more information.
This document discusses Hortonworks and its mission to enable modern data architectures through Apache Hadoop. It provides details on Hortonworks' commitment to open source development through Apache, engineering Hadoop for enterprise use, and integrating Hadoop with existing technologies. The document outlines Hortonworks' services and the Hortonworks Data Platform (HDP) for storage, processing, and management of data in Hadoop. It also discusses Hortonworks' contributions to Apache Hadoop and related projects as well as enhancing SQL capabilities and performance in Apache Hive.
Today's organizations contend with more diverse applications, data, and systems than ever before – silos that are often fragmented and difficult to leverage together. iWay Big Data Integrator (BDI) simplifies the creation, management, and use of Hadoop-based data lakes. It provides a modern, native approach to Hadoop-based data integration and management that ensures high levels of capability, compatibility, and flexibility to help your organization.
Join us to learn how you can simplify adoption of Apache Hadoop using iWay Big Data Integrator. Learn about our ability to streamline the deployment of ingestion, transformation, and extraction tasks.
See the pre-recorded webcast online at: http://www.informationbuilders.com/webevents/online/24427#sthash.J0cRy1PG.dpuf
The document summarizes Oracle's Big Data Appliance and solutions. It discusses the Big Data Appliance hardware which includes 18 servers with 48GB memory, 12 Intel cores, and 24TB storage per node. The software includes Oracle Linux, Apache Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and Oracle Loader for Hadoop. Oracle Loader for Hadoop can be used to load data from Hadoop into Oracle Database in online or offline mode. The Big Data Appliance provides an optimized platform for storing and analyzing large amounts of data and is integrated with Oracle Exadata.
The document discusses Oracle's data integration products and big data solutions. It outlines five core capabilities of Oracle's data integration platform, including data availability, data movement, data transformation, data governance, and streaming data. It then describes eight core products that address real-time and streaming integration, ELT integration, data preparation, streaming analytics, dataflow ML, metadata management, data quality, and more. The document also outlines five cloud solutions for data integration including data migrations, data warehouse integration, development and test environments, high availability, and heterogeneous cloud. Finally, it discusses pragmatic big data solutions for data ingestion, transformations, governance, connectors, and streaming big data.
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseRizaldy Ignacio
Big SQL 3.0 provides a powerful way to run SQL queries on Hadoop data without compromises. It uses a modern MPP architecture instead of MapReduce for high performance. Federation allows Big SQL to access external data sources within a single SQL statement, enabling hybrid data warehouse scenarios.
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
More and more organizations are turning to Hadoop and NoSQL to manage big data. In fact, many IT professionals consider each of those terms to be synonymous with big data. At the same time, these two technologies are seen as different beasts that handle different challenges. That means they are often deployed in a rather disjointed way, even when intended to solve the same overarching business problem. The emerging trend of “in-Hadoop databases” promises to narrow the deployment gap between them and enable new enterprise applications. In this talk, Dale will describe that integrated architecture and how customers have deployed it to benefit both the technical and the business teams.
Cómo Oracle ha logrado separar el motor SQL de su emblemática base de datos para procesar las consultas y los drivers de acceso que permiten leer datos, tanto de ficheros sobre el Hadoop Distributed File System, como de la herramienta de Data Warehousing, HIVE.
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
This document summarizes a presentation given by Nicholas Berg of Seagate and Adriana Zubiri of IBM on delivering analytics across organizations using Hadoop and SQL. Some key points discussed include Seagate's plans to use Hadoop to enable deeper analysis of factory and field data, the evolving Hadoop landscape and rise of SQL, and a performance comparison showing IBM's Big SQL outperforming Spark SQL, especially at scale. The document provides an overview of Seagate and IBM's strategies and experiences with Hadoop.
Leveraging SAP HANA with Apache Hadoop and SAP AnalyticsMethod360
The rise of big data and the Apache Hadoop platform allows for the capture and processing of data at an unprecedented scale and velocity. Watch this slide deck to get a comprehensive overview of the Apache Hadoop platform architecture and learn how to leverage the strengths of both the Apache Hadoop and SAP HANA platforms.
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
In recent years, Apache™ Hadoop® has emerged from humble beginnings to disrupt the traditional disciplines of information management. As with all technology innovation, hype is rampant, and data professionals are easily overwhelmed by diverse opinions and confusing messages.
Even seasoned practitioners sometimes miss the point, claiming for example that Hadoop replaces relational databases and is becoming the new data warehouse. It is easy to see where these claims originate since both Hadoop and Teradata® systems run in parallel, scale up to enormous data volumes and have shared-nothing architectures. At a conceptual level, it is easy to think they are interchangeable, but the differences overwhelm the similarities. This session will shed light on the differences and help architects, engineering executives, and data scientists identify when to deploy Hadoop and when it is best to use MPP relational database in a data warehouse, discovery platform, or other workload-specific applications.
Two of the most trusted experts in their fields, Steve Wooledge, VP of Product Marketing from Teradata and Jim Walker of Hortonworks will examine how big data technologies are being used today by practical big data practitioners.
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
This document discusses how organizations can leverage data and analytics to power their business models. It provides examples of Fortune 100 companies that are using Attunity products to build data lakes and ingest data from SAP and other sources into Hadoop, Apache Kafka, and the cloud in order to perform real-time analytics. The document outlines the benefits of Attunity's data replication tools for extracting, transforming, and loading SAP and other enterprise data into data lakes and data warehouses.
Are you confused by Big Data? Get in touch with this new "black gold" and familiarize yourself with undiscovered insights through our complimentary introductory lesson on Big Data and Hadoop!
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
Azure Big Data: “Got Data? Go Modern and Monetize”.
In this session you will learn how to architected, developed, and build completely in the open, Hortonworks Data Platform (HDP) that provides an enterprise ready data platform to adopt a Modern Data Architecture.
Hitachi Data Systems Hadoop Solution. Customers are seeing exponential growth of unstructured data from their social media websites to operational sources. Their enterprise data warehouses are not designed to handle such high volumes and varieties of data. Hadoop, the latest software platform that scales to process massive volumes of unstructured and semi-structured data by distributing the workload through clusters of servers, is giving customers new option to tackle data growth and deploy big data analysis to help better understand their business. Hitachi Data Systems is launching its latest Hadoop reference architecture, which is pre-tested with Cloudera Hadoop distribution to provide a faster time to market for customers deploying Hadoop applications. HDS, Cloudera and Hitachi Consulting will present together and explain how to get you there. Attend this WebTech and learn how to: Solve big-data problems with Hadoop. Deploy Hadoop in your data warehouse environment to better manage your unstructured and structured data. Implement Hadoop using HDS Hadoop reference architecture. For more information on Hitachi Data Systems Hadoop Solution please read our blog: http://blogs.hds.com/hdsblog/2012/07/a-series-on-hadoop-architecture.html
The document provides an overview of Hadoop, including:
- What Hadoop is and its core modules like HDFS, YARN, and MapReduce.
- Reasons for using Hadoop like its ability to process large datasets faster across clusters and provide predictive analytics.
- When Hadoop should and should not be used, such as for real-time analytics versus large, diverse datasets.
- Options for deploying Hadoop including as a service on cloud platforms, on infrastructure as a service providers, or on-premise with different distributions.
- Components that make up the Hadoop ecosystem like Pig, Hive, HBase, and Mahout.
Similar to "Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovation Cloud Solution Center Lead at DataVard (20)
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Dataconomy Media
The challenges of increasing complexity of organizations, companies and projects are obvious and omnipresent. Everywhere there are connections and dependencies that are often not adequately managed or not considered at all because of a lack of technology or expertise to uncover and leverage the relationships in data and information. In his presentation, Axel Morgner talks about graph technology and knowledge graphs as indispensable building blocks for successful companies.
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...Dataconomy Media
The document discusses emerging technologies and their potential impacts, and questions how individuals and societies can responsibly address issues arising from new technologies. It notes that governments, regulators, and individuals struggle to understand new concepts that spread rapidly. It asks if there are existing systems or forms of cooperation that could help societies address responsibilities related to technologies, but offers no definitive solutions, mainly posing questions.
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...Dataconomy Media
Every day we are challenged with more data, more use cases and an ever increasing demand for analytics. In this talk Bjorn will explain how autonomous data management and machine learning help innovators to more productive and give examples how to deliver new data driven projects with less risk at lower costs.
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...Dataconomy Media
This document contains an agenda and presentation materials for a talk on building and deploying an anti-money laundering (AML) model using DataRobot. The agenda includes introductions to DataRobot and AML, an AML demo, a real AML use case example, and a question and answer section. The presentation materials provide background on DataRobot, including its history and products. It also gives an overview of money laundering and how AML works, both traditionally using rule-based systems and how machine learning can help by reducing false positives and improving efficiency. A case study shows how DataRobot has helped other organizations with AML use cases.
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Dataconomy Media
Trump, Brexit, Cambridge Analytica... In the last few years, we have had to confront the consequences of the use and misuse of data science algorithms in manipulating public opinion through social media. The use of private data to microtarget individuals is a daily practice (and a trillion-dollar industry), which has serious side-effects when the selling product is your political ideology. How can we cope with this new scenario?
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...Dataconomy Media
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help alleviate symptoms of mental illness and boost overall mental well-being.
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...Dataconomy Media
The document discusses data innovation and Men on the Moon's approach. It notes that while there is a large amount of available data worldwide, only a small portion is used to create value. Most data science projects also fail. The document then outlines Men on the Moon's "Data Thinking" approach, which combines design thinking and data science. Their approach involves defining a data vision, identifying use cases, prototyping solutions, and enabling employees. The goal is to leverage data to create valuable solutions for people through data innovation.
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...Dataconomy Media
What does it take to build a good data product or service? Data practitioners always think about the technology, user experience and commercial viability. But rarely do they think about the implications of the systems they build. This talk will shed light on the impact of AI systems and the unintended consequences of the use of data in different products. It will also discuss our role, as data practitioners, in planting the seeds of fairness in the systems we build.
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...Dataconomy Media
People analytics uses data science techniques like machine learning and pattern recognition on employee data to generate insights and reports that can help businesses make smarter talent and operational decisions. These decisions can improve workforce effectiveness, engagement, recruitment, retention and performance while also increasing sales and reducing fraud and accidents. People analytics technologies include surveys, correlation analysis, machine learning and AI which can help companies improve their culture, develop employee skills and boost growth when the results are properly implemented.
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...Dataconomy Media
Cloud Infrastructure is a hostile environment: a power supply failure or a network outage leads to downtime and big losses. There is nothing we can trust: a single server, a server rack, even a whole datacenter can fail, and if an application is fragile by design, disruption is inevitable. We must distribute our application and diversify cloud data strategy to survive disturbances of any scale. Apache Cassandra is a cloud-native platform-agnostic database that stores data with a distributed redundancy so it easily survives any issue. What to know how Apple and Netflix handle petabytes of data, keeping it highly available? Join us and listen to a story of 10 little servers and no downtime!
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...Dataconomy Media
In the data industry, having correctly labelled datasets is vital. Timothy Thatcher explains how tagging your data while considering time and location and complex hierarchical rules at scale can be handled.
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...Dataconomy Media
This document discusses using machine learning to analyze individual and interpersonal behavior for clinical diagnosis and screening. It focuses on analyzing non-verbal behaviors like interpersonal synchronization that have been shown to be impaired in conditions like autism spectrum disorder. The document proposes that machine learning could provide an objective, automated tool for diagnosing conditions more quickly by analyzing video recordings of social interactions. This may help address bottlenecks in healthcare systems and allow earlier access to treatment.
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...Dataconomy Media
This document discusses the end-to-end experimentation platform at GetYourGuide for A/B testing. It outlines the challenges of running experiments such as imbalanced assignments, suspicious metric changes, and non-converging results. It also describes the tools used for planning experiments, monitoring assignments, performing daily checks, and analyzing results. The goal is to validate UX changes, estimate effects on customers, and make more objective decisions through A/B testing while addressing issues that could impact experiment quality.
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...Dataconomy Media
Cloud Infrastructure is a hostile environment: a power supply failure or a network outage leads to downtime and big losses. There is nothing we can trust: a single server, a server rack, even a whole datacenter can fail, and if an application is fragile by design, disruption is inevitable. We must distribute our application and diversify cloud data strategy to survive disturbances of any scale. Apache Cassandra is a cloud-native platform-agnostic database that stores data with a distributed redundancy so it easily survives any issue. What to know how Apple and Netflix handle petabytes of data, keeping it highly available? Join us and listen to a story of 10 little servers and no downtime!
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...Dataconomy Media
Creativity is the mental ability to create new ideas and designs. Innovation, on the other hand, Means developing useful solutions from new ideas. Creativity can be goal-oriented, Whereas innovation is always goal-oriented. This bedeutet, dass innovation aims to achieve defined goals. The use of cloud services and technologies promises enterprise users many benefits in terms of more flexible use of IT resources and faster access to innovative solutions. That’s why we want to examine the question in this talk, of what role cloud computing plays for innovation in companies.
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...Dataconomy Media
Presentation of Time Series Properties of Financial Instrument and Possibilities in Frequency Decomposition and Information Extraction using FT, STFT and Wavelets with Outlook in Current Research on Wavelet Neural Networks
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Dataconomy Media
"With most machine learning (ML) and deep learning (DL) frameworks, it can take hours to move data for ETL, and hours to train models. It's also hard to scale, with data sets increasingly being larger than the capacity of any single server. The amount of the data also makes it hard to incrementally test and retrain models in near real-time.
Learn how Apache Ignite and GridGain help to address limitations like ETL costs, scaling issues and Time-To-Market for the new models and help achieve near-real-time, continuous learning.
Yuriy Babak, the head of ML/DL framework development at GridGain and Apache Ignite committer, will explain how ML/DL work with Apache Ignite, and how to get started.
Topics include:
— Overview of distributed ML/DL including architecture, implementation, usage patterns, pros and cons
— Overview of Apache Ignite ML/DL, including built-in ML/DL algorithms, and how to implement your own
— Model inference with Apache Ignite, including how to train models with other libraries, like Apache Spark, and deploy them in Ignite
— How Apache Ignite and TensorFlow can be used together to build distributed DL model training and inference"
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Dataconomy Media
"Machine learning algorithms require significant amounts of training data which has been centralized on one machine or in a datacenter so far. For numerous applications, such need of collecting data can be extremely privacy-invasive. Recent advancements in AI research approach this issue by a new paradigm of training AI models, i.e., Federated Learning.
In federated learning, edge devices (phones, computers, cars etc.) collaboratively learn a shared AI model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. From personal data perspective, this paradigm enables a way of training a model on the device without directly inspecting users’ data on a server. This talk will pinpoint several examples of AI applications benefiting from federated learning and the likely future of privacy-aware systems."
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Open Source Contributions to Postgres: The Basics POSETTE 2024
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovation Cloud Solution Center Lead at DataVard
1. # 1
Integration of Hadoop in Business landscape
Michal Alexa
Service Line Manager
Data Innovation Lab
December 2016
2. # 2
3.472 images
pinned
72 hours new
video content
uploaded
204.000.000 emails
sent
4.000.000 search
queries
277.000 tweets
347.222 photos
sent
Users sweep
416.667 times
2.460.000 new
items of content
shared
216.000 photos
shared
$ 83.000 in online
sales
48.000 apps
downloaded from
the Itunes store
26.380 new
reviews
What happens on the Internet in 60 seconds (2014)
3. # 3
Big-Data and Business world
Big-Data
Java, Python, PigLatin
Massive clusters for big data processing
Structured & unstructured data
Apache & open source
Distributions (e.g. Cloudera)
Engines (Spark, Impala)
Fast paced evolution since 2006
4. # 4
Big-Data and Business world
Big-Data
Java, Python, PigLatin
Massive clusters for big data processing
Structured & unstructured data
Apache & open source
Distributions (e.g. Cloudera)
Engines (Spark, Impala)
Fast paced evolution since 2006
???
ABAP
Client/Server
classic RDBMS as relational database
Proprietary software with interfaces
Engines OLTP, OLAP
World Positioning: 76% of finance
transactions, 78% of food
production, 82% medical devices
Steady evolution since 1972
5. # 5
Big-Data and Business world
Big-Data
Java, Python, PigLatin
Massive clusters for big data processing
Structured & unstructured data
Apache & open source
Distributions (e.g. Cloudera)
Engines (Spark, Impala)
Fast paced evolution since 2006
Business
ABAP
Client/Server
classic RDBMS as relational database
Proprietary software with interfaces
Engines OLTP, OLAP
World Positioning: 76% of finance
transactions, 78% of food production,
82% medical devices
Steady evolution since 1972
8. # 8
Biggest struggles in Data Management
Scalability
Data-Pipelines
Granularity and Velocity
Data-Silos
Extensibility
• Not any more possible to do lifetime sizing of platform during procurement
• HW requirements create limitations to possible growth
• Scale UP comes often with great cost, and scale DOWN is usually
valueless
• Data transformations are I/O intensive operations
• Take lot of time, consume lot of resources
• Limitations on format of data
• Limitations on granularity of data, often only aggregated and cleaned
data are stored
• Raw data are necessary for data science activities
• Too many places for storing data
• No interconnection between company units limits data analyzing
possibilities
• Data analyses requires lot of programing languages
• Limited applications compatibility
9. # 9
What is Apache Hadoop?
A software framework for storing, processing and analyzing
“big data”
ScalableDistributed Fault-TolerantOpen Source
11. # 11
“Data-Lake” In Business infrastructure
Data-Lake
BW
Source
systems
logs
12. # 12
“Data-Lake” In Business infrastructure
Data-Lake
BW
Source
systems
logs
BW
13. # 13
Emerging new technologies – Integration answers to Big-Data
Smart Data Access
• Data federation feature
available on SAP HANA
• Not fully read-write
• Sybase ASE, Sybase IQ,
Teradata, and Hadoop and
some other databases
Dynamic Tearing
• Supports only Write
Optimized DSO and PSA
• Some restrictions
• Sybase IQ only
• Limited disaster
recovery
• Read & write, but
only on HANA
SDA DT
Nearline Storage
• Move data from online to
“nearline” database
• Read only
• Uses DAP (Data Archiving
Processes)
• Wrong assumption of
Sybase IQ as “one and
only” storage
NLS
SAP HANA VORA
• DB interface between HANA
and Hadoop (Spark)
• Heavily Java-based – no ABAP
workbench integration etc.
• No UI – engine only
• Allows for reporting within
Hadoop based on Spark
VORA
DLM
Data Lifecycle Manager
• Hana Native only, no ERP
• Offloading to IQ or Spark
14. # 14
Emerging new technologies – Integration answers to Big Data
Smart Data Access
• Data federation feature
available on SAP HANA
• Not fully read-write
• Sybase ASE, Sybase IQ,
Teradata, and Hadoop and
some other databases
Dynamic Tiering
• Supports only Write
Optimized DSO and PSA
• Some restrictions
• Sybase IQ only
• Limited disaster
recovery
• Read & write, but
only on HANA
SDA DT
Nearline Storage
• Move data from online to
“nearline” database
• Read only
• Uses DAP (Data Archiving
Processes)
• SAP positions Sybase IQ
as “one and only” storage
NLS
SAP HANA VORA
• DB interface between HANA
and Hadoop (Spark)
• Heavily Java-based – no ABAP
workbench integration etc.
• No UI – engine only
• Allows for reporting within
Hadoop based on Spark
VORA
DLM
Data Lifecycle Manager
• Hana Native only, no ERP
• Offloading to IQ or Spark
Offloading Integration
15. # 15
Business <> Hadoop struggle
Hadoop Integration with Businesses is difficult for
several reasons:
Technology readiness
IT culture
Data integration
Operations
• Development strategy
• Software logistics
• Rapid prototyping
• Data protection / personal
data
• SOX compliance
IT culture gap Data integration gap Operational gap
• ETL
• Loading of data
• Staging & enriching of
data within Hadoop
• Data flows from SAP to
Hadoop and back
• Running applications 24x7
between SAP and Hadoop
• Job scheduling
• Testing
• Patching & upgrades
We should intend to close those gaps
16. # 16
Summary
• Hadoop is awesome! Lets make it really
available for all businesses.
• Start small, small amount of data and
fast turnover.
• Think about how to enable new
technology to others.