This document provides an overview of big data concepts and technologies for managers. It discusses problems with relational databases for large, unstructured data and introduces NoSQL databases and Hadoop as solutions. It also summarizes common big data applications, frameworks like MapReduce, Spark, and Flink, and different NoSQL database categories including key-value, column-family, document, and graph stores.
My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
Geisinger Health System is well known in the healthcare community as a pioneer in data and analytics. We have had an Electronic Health Record (EHR) since 1996, and an Electronic Data Warehouse (EDW) since 2008. Much of daily and weekly operational reporting, as well as an abundance of ad hoc analytics, come from the EDW.
Approximately 18 months ago, the Data Management team implemented Hadoop in the Hortonworks Data Platform (HDP), and successes in implementation and development have proven to the organization that we should abandon the traditional EDW in favor of the Big Data (HDP) platform.
In less than 18 months, we stood up the platform, created a data ingestion pipeline, duplicated all source feeds from the EDW into HDP, and had several analytics developed with HDP and Tableau. Furthermore, we have exploited the new capabilities of the platform, where we use Natural Language Processing (NLP) to interrogate valuable (but previously hidden) clinical notes. The new platform has data that is modeled and governed, setting the stage to push Geisinger Health System from a pioneer to a leader in Big Data and Analytics.
This session will focus on Hortonworks Data Platform, covering data architecture, security, data process flow, and development. It is geared toward Data Architects, Data Scientists, and Operations/I.T. audiences.
Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
In this session we take an in-depth look into the Apache Atlas open metadata and governance function.
Open metadata and governance is a moon-shot type of project to create a set of open APIs, types, and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery, and access frameworks to automate the collection, management, and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed, and used in order to deliver maximum value to the enterprise.
Apache Atlas is the reference implementation of the Open Metadata and Governance standards and framework (https://cwiki.apache.org/confluence/display/ATLAS/Open+Metadata+and+Governance). This function will enable an Apache Atlas server to synchronize and query metadata from any open metadata-compliant metadata repository.
In this session we will cover how Open Metadata and Governance works. This includes: (1) the key components in Atlas, (2) the different integration patterns and APIs that vendors can use to integrate their technology into the open metadata ecosystem, and (3) how common metadata use cases such as searching for data sets, managing security (through Atlas/Ranger integration), and automated metadata discovery work in the active ecosystem.
Speaker
Mandy Chessell, Distinguished Engineer, IBM
My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
Geisinger Health System is well known in the healthcare community as a pioneer in data and analytics. We have had an Electronic Health Record (EHR) since 1996, and an Electronic Data Warehouse (EDW) since 2008. Much of daily and weekly operational reporting, as well as an abundance of ad hoc analytics, come from the EDW.
Approximately 18 months ago, the Data Management team implemented Hadoop in the Hortonworks Data Platform (HDP), and successes in implementation and development have proven to the organization that we should abandon the traditional EDW in favor of the Big Data (HDP) platform.
In less than 18 months, we stood up the platform, created a data ingestion pipeline, duplicated all source feeds from the EDW into HDP, and had several analytics developed with HDP and Tableau. Furthermore, we have exploited the new capabilities of the platform, where we use Natural Language Processing (NLP) to interrogate valuable (but previously hidden) clinical notes. The new platform has data that is modeled and governed, setting the stage to push Geisinger Health System from a pioneer to a leader in Big Data and Analytics.
This session will focus on Hortonworks Data Platform, covering data architecture, security, data process flow, and development. It is geared toward Data Architects, Data Scientists, and Operations/I.T. audiences.
Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
In this session we take an in-depth look into the Apache Atlas open metadata and governance function.
Open metadata and governance is a moon-shot type of project to create a set of open APIs, types, and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery, and access frameworks to automate the collection, management, and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed, and used in order to deliver maximum value to the enterprise.
Apache Atlas is the reference implementation of the Open Metadata and Governance standards and framework (https://cwiki.apache.org/confluence/display/ATLAS/Open+Metadata+and+Governance). This function will enable an Apache Atlas server to synchronize and query metadata from any open metadata-compliant metadata repository.
In this session we will cover how Open Metadata and Governance works. This includes: (1) the key components in Atlas, (2) the different integration patterns and APIs that vendors can use to integrate their technology into the open metadata ecosystem, and (3) how common metadata use cases such as searching for data sets, managing security (through Atlas/Ranger integration), and automated metadata discovery work in the active ecosystem.
Speaker
Mandy Chessell, Distinguished Engineer, IBM
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...DataWorks Summit
For firms in the financial industry, especially within regulated organizations such as credit card processors and banks, PCI DSS compliance has become a business and operational necessity. Although the blueprint of a PCI-compliant architecture varies from organization to organization, the mixture of modern Hadoop-based data lakes and legacy systems are a common theme.
In this talk, we will discuss recent updates to PCI DSS and how significant portions of PCI DSS compliance controls can be achieved using open source Hadoop security stack and technologies for the Hadoop ecosystem. We will provide a broad overview of implementing key aspects of PCI DSS standards at WorldPay such as encryption management, data protection with anonymization, separation of duties, and deployment considerations regarding securing the Hadoop clusters at the network layer from a practitioner’s perspective. The talk will provide patterns and practices map current Hadoop security capabilities to security controls that a PCI-compliant environment requires.
Speaker
David Walker, Enterprise Data Platform Programme Director, Worldpay
Srikanth Venkat, Senior Director Product Management, Hortonworks
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...DataWorks Summit
Finance Data Lake objective is to create a centralized enterprise data repository for all Finance and Supply Chain data. It serves as the single source of truth. It enables a self-service discovery Analytics platform for business users to answer adhoc business questions and derive critical insights. The data lake is based on open source Hadoop big data platform and a very cost effective solution in breaking the ERP data silos and simplifying the data architecture in the enterprise.
POCs were conducted on in-house Hortonworks Hadoop data platform to validate the cluster performance for Production volumes. Based on business priorities, an initial roadmap was defined using 3 data sources including 2 SAP ERPs and Peoplesoft (OLTP systems). Development environment was established in AWS Cloud for agile delivery. The near real time data ingestion architecture for the data lake was defined using replication tools and custom SQOOP based micro-batching framework and data persisted in Apache Hive DB in ORC format. Data and user security is implemented using Apache Ranger and sensitive data stored at rest in encryption zones. Business data sets were developed in Hive scripts and scheduled using Oozie. Multiple reporting tools connectivity including SQL tools, Excel and Tableau were enabled for Self-service Analytics. Upon successful implementation of the initial phase, a full roadmap is established to extend the Finance data lake to over 25 data sources and enhance data ingestion to scale as well as enable OLAP tools on Hadoop.
Strata San Jose 2017 - Ben Sharma PresentationZaloni
Learn about the promise of data lakes:
- Store all types of data in its raw format
- Create refined, standardized, trusted datasets for various use cases
- Store data for longer periods of time to enable historical analysis - Query and Access the data using a variety of methods
- Manage streaming and batch data in a converged platform
- Provide shorter time-to-insight with proper data management and governance
Prior to 2014, Walgreens has traditional Enterprise Datawarehouse Systems that have reached the capacity limits. Over the last three years we have evolved, learned lessons, experienced successes and failures. Our initial adoption of Hadoop came from the need to run complex analytics which simply did not scale on MPP RDBMS. Our business data demands were rapidly increasing and the 8 to 12 weeks concomitant extract, transform, and load turn around cycles was not a acceptable deliverable timeframe in the retail space. A self service model where data lands on a distributed platform, apply schema where necessary, and process at scale was a necessary paradigm for business value enablement. Our journey started with single use case which has now evolved to enterprise data hub. We will discuss following points: Evolution of our infrastructure profile, streamlining the hardware provisioning cycle, and our hybrid deployment model (on premise & cloud). Operations, how SmartSense has helped us proactively tune our cluster, and which operational tests we use for benchmarking the cluster. Monitoring, how we monitor and the tools required for enterprise grade monitoring. Security and governance how we progressed from non–compliance to enterprise grade using Ranger, Knox, Kerberos, HP voltage, encryption at rest, and many other services. 3rd Party integration with HDP, what we learned and how we overcame the challenges. Lastly, how we approach our disaster recovery strategy, what is driving the need for a DR and the key capabilities required.
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...DataWorks Summit
For firms in the financial industry, especially within regulated organizations such as credit card processors and banks, PCI DSS compliance has become a business and operational necessity. Although the blueprint of a PCI-compliant architecture varies from organization to organization, the mixture of modern Hadoop-based data lakes and legacy systems are a common theme.
In this talk, we will discuss recent updates to PCI DSS and how significant portions of PCI DSS compliance controls can be achieved using open source Hadoop security stack and technologies for the Hadoop ecosystem. We will provide a broad overview of implementing key aspects of PCI DSS standards at WorldPay such as encryption management, data protection with anonymization, separation of duties, and deployment considerations regarding securing the Hadoop clusters at the network layer from a practitioner’s perspective. The talk will provide patterns and practices map current Hadoop security capabilities to security controls that a PCI-compliant environment requires.
Speaker
David Walker, Enterprise Data Platform Programme Director, Worldpay
Srikanth Venkat, Senior Director Product Management, Hortonworks
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...DataWorks Summit
Finance Data Lake objective is to create a centralized enterprise data repository for all Finance and Supply Chain data. It serves as the single source of truth. It enables a self-service discovery Analytics platform for business users to answer adhoc business questions and derive critical insights. The data lake is based on open source Hadoop big data platform and a very cost effective solution in breaking the ERP data silos and simplifying the data architecture in the enterprise.
POCs were conducted on in-house Hortonworks Hadoop data platform to validate the cluster performance for Production volumes. Based on business priorities, an initial roadmap was defined using 3 data sources including 2 SAP ERPs and Peoplesoft (OLTP systems). Development environment was established in AWS Cloud for agile delivery. The near real time data ingestion architecture for the data lake was defined using replication tools and custom SQOOP based micro-batching framework and data persisted in Apache Hive DB in ORC format. Data and user security is implemented using Apache Ranger and sensitive data stored at rest in encryption zones. Business data sets were developed in Hive scripts and scheduled using Oozie. Multiple reporting tools connectivity including SQL tools, Excel and Tableau were enabled for Self-service Analytics. Upon successful implementation of the initial phase, a full roadmap is established to extend the Finance data lake to over 25 data sources and enhance data ingestion to scale as well as enable OLAP tools on Hadoop.
Strata San Jose 2017 - Ben Sharma PresentationZaloni
Learn about the promise of data lakes:
- Store all types of data in its raw format
- Create refined, standardized, trusted datasets for various use cases
- Store data for longer periods of time to enable historical analysis - Query and Access the data using a variety of methods
- Manage streaming and batch data in a converged platform
- Provide shorter time-to-insight with proper data management and governance
Prior to 2014, Walgreens has traditional Enterprise Datawarehouse Systems that have reached the capacity limits. Over the last three years we have evolved, learned lessons, experienced successes and failures. Our initial adoption of Hadoop came from the need to run complex analytics which simply did not scale on MPP RDBMS. Our business data demands were rapidly increasing and the 8 to 12 weeks concomitant extract, transform, and load turn around cycles was not a acceptable deliverable timeframe in the retail space. A self service model where data lands on a distributed platform, apply schema where necessary, and process at scale was a necessary paradigm for business value enablement. Our journey started with single use case which has now evolved to enterprise data hub. We will discuss following points: Evolution of our infrastructure profile, streamlining the hardware provisioning cycle, and our hybrid deployment model (on premise & cloud). Operations, how SmartSense has helped us proactively tune our cluster, and which operational tests we use for benchmarking the cluster. Monitoring, how we monitor and the tools required for enterprise grade monitoring. Security and governance how we progressed from non–compliance to enterprise grade using Ranger, Knox, Kerberos, HP voltage, encryption at rest, and many other services. 3rd Party integration with HDP, what we learned and how we overcame the challenges. Lastly, how we approach our disaster recovery strategy, what is driving the need for a DR and the key capabilities required.
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
Introduction to Deep Learning and AI at Scale for ManagersDataWorks Summit
Deep Learning and the new wave of AI are inevitably coming to your business area. If you are a manager and if you are trying to make sense of all the buzzwords, this session is four you. We will show you what is Deep Learning in a way that you will understand how it works and how can you apply it. We then expand the scope and apply the deep learning and AI techniques in the Big Data context. You will learn about things that don't work out so well, the risks and challenges in both applying and developing with deep learning and AI technologies. We conclude with practical guidance on how to add the exciting deep learning and AI capabilities to your next project.
Outline:
- The path to Deep Learning
- From machine learning to Deep Learning
- But how does it work?
- Deep Learning architectures
- Deep Learning applications
- Deep Learning at scale
- Running AI at scale
- Deep learning at Scale using Spark
- The trouble with AI
- Application challenges
- Development challenges
- How to start your first Deep Learning project
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
This talk focus is on what a data reservoir is, how it related to the RDBMS DW, and how Big Data Discovery provides access to it to business and BI users
Large corporations have to master vast amounts of heterogeneous data in order to stay competitive. While existing approaches have attempted to consolidate and manage the data by forcing it into a single shared data model, data lakes recently emerged that instead provide a central storage point for holding all data sets in their original form. In this talk, we present eccenca Corporate Memory, which extends the data lake paradigm with a semantic integration layer for managing diverse, but semantically enriched data. In addition to that, we depict our vision for public / private data co-evolution and how we research this topic in the joint project Linked Enterprise Data Services (LEDS) together with the University of Leipzig and other partners.
from René Pietzsch | Head of Product Management, Eccenca
and Dr. Michael Martin | AKSW, Universität Leipzig, LEDS Project
Presentation at Sachsentag der Angewandten informatik 2016 in Leipzig in the context with the results of the LEDS project
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce SPICE - a new Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools.
Presented by: Matthew McClean, AWS Partner Solutions Architect, Amazon Web Services
Datameer 6 is completely re-imagining the user experience for modern BI, helping you deliver new insights faster and more results to your data-hungry business. During this one-hour webinar, we demonstrated all that's new with Datameer 6 and how you can:
Discover answers to a new range of business questions using an iterative, exploratory approach
Find answers faster and deliver more insights with a new faster analytic workflow
Utilize Spark to speed analytic processing time without needing to know technical details
Watch this on-demand webinar, with special guest speaker, Sean Anderson, Senior Product Marketing Manager, from Cloudera, who discusses Cloudera's view of the Hadoop data processing stack and how the market place is benefiting from Spark.
This is the deck that I used for the talks that I gave in the Silicon Valley / San Francisco bay area at various events in April and May 2016.
1. Introduces Big Data and related challenges.
2. Briefly covers some of the important open-source big data related technologies.
3. Introduces Hadoop
4. Introduces Spark Core, Spark SQL, MLlib and GraphX
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachDATAVERSITY
After three decades of relational data modeling, everyone’s pretty comfortable with schemas, tables, and entity-relationships. As more and more Global 2000 companies choose NoSQL databases to power their Digital Economy applications, they need to think about how to best model their data. How do they move from a constrained, table-driven model to an agile, flexible data model based on JSON documents?
This webinar is intended for architects and application developers who want to learn about new JSON document data modeling approaches, techniques, and best practices. This webinar will show you how to get started building a JSON document data model, how to migrate a table-based data model to JSON documents, and how to optimize your design to enable fast query performance.
This webinar will provide practical, experience-based advice and best practices for modeling JSON documents, including:
- When to embed or not embed objects in your JSON document
- Data modeling using a practical data access pattern approach
- Indexing your JSON documents
- Querying your data using N1QL (SQL for JSON)
Horses for Courses: Database RoundtableEric Kavanagh
The blessing and curse of today's database market? So many choices! While relational databases still dominate the day-to-day business, a host of alternatives has evolved around very specific use cases: graph, document, NoSQL, hybrid (HTAP), column store, the list goes on. And the database tools market is teeming with activity as well. Register for this special Research Webcast to hear Dr. Robin Bloor share his early findings about the evolving database market. He'll be joined by Steve Sarsfield of HPE Vertica, and Robert Reeves of Datical in a roundtable discussion with Bloor Group CEO Eric Kavanagh. Send any questions to info@insideanalysis.com, or tweet with #DBSurvival.
View the companion webinar at: http://embt.co/1L8V6dI
Some claim that, in the age of Big Data, data modeling is less important or even not needed. However, with the increased complexity of the data landscape, it is actually more important to incorporate data modeling in order to understand the nature of the data and how they are interrelated. In order to do this effectively, the way that we do data modeling needs to adapt to this complex environment.
One of the key data modeling issues is how to foster collaboration between new groups, such as data scientists, and traditional data management groups. There are often different paradigms, and yet it is critical to have a common understanding of data and semantics between different parts of an organization. In this presentation, Len Silverston will discuss:
+ How Big Data has changed our landscape and affected data modeling
+ How to conduct data modeling in a more ‘agile’ way for Big Data environments
+ How we can collaborate effectively within an organization, even with differing perspectives
About the Presenter:
Len Silverston is a best-selling author, consultant, and a fun and top rated speaker in the field of data modeling, data governance, as well as human behavior in the data management industry, where he has pioneered new approaches to effectively tackle enterprise data management. He has helped many organizations world-wide to integrate their data, systems and even their people. He is well known for his work on "Universal Data Models", which are described in The Data Model Resource Book series (Volumes 1, 2, and 3).
These slides - based on the webinar - shed light on how business stakeholders make the most of information from their big data environments and the requirements those stakeholders have to turn big data into business impact.
Using recent big data end-user research from leading IT analyst firm Enterprise Management (EMA), data from Vertica’s recent benchmarks on SQL on Hadoop, and firsthand customer experiences, viewers will learn:
- Use cases where end users around the world are using big data in their organizations
- How maturity with big data strategies impact why and how business stakeholders use information from their big data environments
- How Vertica empowers the use of information from big data environments
Operational Analytics Using Spark and NoSQL Data StoresDATAVERSITY
NoSQL data stores have emerged for scalable capture and real-time analysis of data. Apache Spark and Hadoop provide additional scalable analytics processing. This session looks at these technologies and how they can be used to support operational analytics to improve operational effectiveness. It also looks at an example of how operational analytics can be implemented in NoSQL environments using the Basho Data Platform with Apache Spark:
•The emergence of NoSQL, Hadoop and Apache Spark
•NoSQL Use Cases
•The need for operational analytics
•Types of operational analysis
•Key requirements for operational analytics
•Operational analytics using the Basho Data Platform with Apache Spark.
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
Databricks' founders caused a seismic shift in data analysis community when they created Apache Spark which has become a cornerstone of Big Data processing pipelines and tools in large and small companies all around the world. Now they've built a revolutionary, comprehensive and easy-to-use platform around Apache Spark and their other inventions, such as MLFlow and Koalas frameworks and most importantly the Data Lakehouse: a concept of fusing data warehouse and data lake architectures into a single versatile and fast platform. Technical foundation for Databricks Data Lakehouse is Delta Lake. More than 7000 organizations today rely on Databricks to enable massive-scale data engineering, collaborative data science, full-lifecycle machine learning and business analytics. Come to the talk and see the demo to find out why.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.