BigDataInOperationsV8

This document discusses Synchronoss' journey in developing their data pipeline and profiling capabilities. It describes: 1) Their initial ETL-based pipeline (V1) that had long batch processes and could not handle large, unstructured data. 2) An upgraded version (V2) using a MPP appliance that improved performance but had high costs. 3) Their adoption of Spark (V4) to build a flexible, scalable pipeline that profiles data in the data lake using RDDs and built-in transformations. 4) This approach improved their data analysis time from weeks to hours and identified data quality issues earlier.

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.

At Under Armour Connected Fitness, we’ve built an event streaming platform on top of Kafka and the Confluent stack that makes it easy for developers to produce and consume schema-based events without requiring direct knowledge of Kafka. We are constantly trying to improve the developer experience. The platform consists of multiple federated Kafka clusters, a schema registry, a topology service, an archiver and specialized client libraries and Web / CLI tools that assist developers with producer and consumer workflows. In this talk, we will take a deeper dive into the design and implementation of a Scala/Java implementation of our client library that allows developers to produce or consume events without worrying about the underlying infrastructure and their location while enjoying the benefits of data compatibility through schemas. We’ll also look at an HTTP based client proxy that exposes the same API but for languages without our native support. Finally, we’ll walk through Web and CLI tools we built to make working with the platform easier. The content of this talk will be primarily aimed at software developers looking for ideas on how to build Kafka client tools that allow producer/consumer interactions protected by schema-based event definitions while hiding details of the underlying infrastructure.

Introduction to basic data analytics tools

Nascenia IT

This document introduces basic data analytics tools. It discusses the data analytics pipeline of collecting, refining, storing, analyzing, and presenting data. It describes tools for each step including Requests and BeautifulSoup for data acquisition, Pandas and SQLAlchemy for data processing and storage, R and RStudio for data analysis, and Plotly and Matplotlib for data visualization. Apache Superset is highlighted as a tool for data visualization and exploration. Challenges of data analytics like data quality, privacy, and scaling are also outlined.

Dealing With Drift - Building an Enterprise Data Lake

Pat Patterson

Data drift, the gradual morphing of data structure and semantics, is a fact of life in enterprise IT. New requirements force schema changes, the meaning of database columns changes over time, and infrastructure upgrades add new fields to log files. Left unchecked, drift in data sources can cause applications and dataflows to fail, with costly downtime and, in the worst case, corruption in downstream data stores. Cox Automotive comprises more than 25 companies dealing with different aspects of the car ownership lifecycle, with data as the common language they all share. The challenge for Cox was to create an efficient engine for the timely and trustworthy ingest of data capability for an unknown but large number of data assets from practically any source. Discover how their big data engineering team overcame data drift and are now populating a data lake, allowing analysts easy access to data from their subsidiary companies and producing new data assets unique to the industry.

The Stream is the Database - Revolutionizing Healthcare Data Architecture

This document discusses using event streams as the system of record for data, rather than traditional databases. It argues that streams can serve as the single source of truth for data, providing benefits like data lineage, auditing, and integrity. It also describes how healthcare company Liaison uses a streaming platform from MapR to power their data integration platform, gaining the advantages of streams while meeting various compliance requirements.

Spark and the Enterprise by Tony Baer

This document discusses the growing popularity and capabilities of the Apache Spark platform for large-scale data analytics. It notes that Spark has over 40 committers, 1000 contributors, and is being used in 179 projects. The document highlights key features of Spark like its ease of use, performance (10-100x faster than MapReduce), flexibility, and ability to handle both batch and real-time processing. It also provides examples of how Spark can help businesses by enabling more complex analytics like predictive modeling, enabling smarter predictions, and allowing insights from real-time data. The document emphasizes that Spark advocates should focus on illustrating tangible business benefits over technical features when discussing Spark with higher-level business stakeholders.

Spark Summit Keynote by Seshu Adunuthula

1) eBay's enterprise data platform uses Apache Spark and Hadoop to process large amounts of structured and unstructured data from various sources to power applications and analytics. 2) Key aspects of the platform include an agile data warehouse, data streams platform using Apache Kafka, and data services to simplify access to data and enable collaborative analytics. 3) eBay leverages this platform to power applications such as search, personalization, fraud prevention, and business intelligence through pipelines that ingest behavioral and transactional data.

Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...

✔ Eric David Benari, PMP

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...

Databricks

How did eBay move their ETL computation from conventional RDBMS environment over to Spark? What did it take to go from a strategic vision to a viable solution? This paper will take you through a journey which lead to an implementation of a 1000+ node Spark Cluster running 10,000+ ETL jobs daily, all done in a span of less than 6 months, by a team with limited Spark experience. We will share the vision, technical architecture, critical Management decisions, Challenges and Road ahead. This will be a unique opportunity to look into this awesome Spark success story at eBay!

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

HP ships millions of PCs, Printers, and other devices every year to customers in all market segments. More customers are seeking services provided with our products enabling new opportunities for HP to create services from the data we can collect from our devices. Every device we ship is an IoT endpoint with powerful CPU to capture rich data. Insights from this data are used internally to improve our products and focus on customer needs. In this presentation, John will focus on HP’s journey to enabling Big Data analytics from within a large enterprise environment. He will review the challenges and how HP decided on AWS, Apache Spark and Databricks as the foundation for their entry into Big Data Analytics. John will also review how HP uses Spark to build analytic services from the data they generate from their devices.

Winning the On-Demand Economy with Spark and Predictive Analytics

SingleStore

Today’s on-demand economy drives companies to provide fast load times, personalization, and instantaneous service for hungry end-users across all types of applications. Yet most still use dated, legacy systems to process and analyze data. In this session, Ankur Goyal, VP of Engineering at MemSQL will showcase implementing a one-click Lambda Architecture with Apache Spark, Apache Kafka and an operational database, resulting in lightning fast analytics on large, changing datasets.

Spark Usage in Enterprise Business Operations

SAP Technology

Predicting Loan Delinquency at One Million Transactions per Second

Real-time applications of predictive models must be able to generate predictions at the rate that transactions are generated. Previously, such applications of models trained using R needed to be converted to other languages like C++ or Java to achieve the required throughput. In this talk, I’ll describe how to use the in-database R processing capabilities of Microsoft R Server to detect fraud in a SQL Server database of loan records at a rate exceeding one million transactions per second. I will also show the process of training the underlying gradient-boosted tree model on a large training set using the out-of-memory algorithms of Microsoft R.

Netflix Data Engineering @ Uber Engineering Meetup

Blake Irvine

Presto summit israel 2019-04

Data analytics at a petabyte scale final

Zillow's favorite big data & machine learning tools

njstevens

Middle Tier Scalability - Present and Future

dfilppi

How the growth of R helps data-driven organizations succeed

[Presented to the 7th China R Users Conference, Beijing, May 2014.] Adoption of the R language has grown rapidly in the last few years, and is ranked as the number-one data science language in several surveys. This accelerating R adoption curve has been driven by the Big Data revolution, and the fact that so many data scientists — having learned R at university — are actively unlocking the secrets hidden in these new, vast data troves. In more than 6 years of writing for the Revolutions blog, I’ve discovered hundreds of applications of R in business, in government, and in the non-profit sector. Sometimes the use of R is obvious, and sometimes it takes a little bit of detective work to learn how R is operating behind the scenes. In this talk, I’ll begin by presenting some recent statistics on the growth of R. Then I’ll recount some of my favourite applications of R, and show how R is behind some amazing innovations in today’s world.

Snowplow presentation for Amsterdam Meetup #3

Snowplow Analytics

Our cofounder Alex Dean gave an introduction to Snowplow and then talked about our roadmap for 2017. Alex touched on several topics including support for more clouds, support for more storage targets, tailoring Snowplow to your industry, more intelligent event sources, moving our batch pipeline to Spark, mega-scale Snowplow and real-time support for Sauna, our decisioning and response system. Presented on 5 April 2017.

What's So Unique About a Columnar Database?

FlyData Inc.

Looking for the right database technology to use? Luckily there are many database technologies to choose from, including relational databases (MySQL, Postgres), NoSQL (MongoDB), columnar databases (Amazon Redshift, BigQuery), and others. Each choice has its own pros and cons, but today let’s walk through how columnar databases are unique, by comparing it against the more traditional row-oriented database (e.g., MySQL).

The Evolution of Big Data Pipelines at Intuit

The document summarizes the evolution of Intuit's big data pipelines over time from disparate and chaotic early stages to their current integrated cloud-based architecture. It describes how Intuit transitioned from siloed data storage to a single cohesive data pipeline using Apache Kafka and real-time processing. It outlines the key components of their current big data pipeline including real-time data collection, processing, profile storage, and monitoring systems and how this pipeline supports use cases like personalization, fraud detection and more.

MicroStrategy at Badoo

Francesco Mucio

This document discusses Badoo's use of MicroStrategy for business intelligence and analytics. It describes how MicroStrategy helped Badoo overcome challenges with their previous BI tool by providing dimensional modeling, self-service reports, and weekly releases. It highlights how MicroStrategy enabled data discovery, analysis delivery, and reporting for over 90 users across various teams. The document also provides examples of query optimizations in MicroStrategy that improved performance. Finally, it discusses how MicroStrategy has enabled Badoo to empower users through visual insights, transaction services, command manager automation, and streamlined web deployments.

What's hot

Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...

confluent

Introduction to basic data analytics tools

Nascenia IT

Dealing With Drift - Building an Enterprise Data Lake

Pat Patterson

The Stream is the Database - Revolutionizing Healthcare Data Architecture

Spark and the Enterprise by Tony Baer

Spark Summit Keynote by Seshu Adunuthula

Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...

✔ Eric David Benari, PMP

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...

Databricks

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

Winning the On-Demand Economy with Spark and Predictive Analytics

SingleStore

Spark Usage in Enterprise Business Operations

SAP Technology

Predicting Loan Delinquency at One Million Transactions per Second

Netflix Data Engineering @ Uber Engineering Meetup

Blake Irvine

Presto summit israel 2019-04

Data analytics at a petabyte scale final

Zillow's favorite big data & machine learning tools

njstevens

Middle Tier Scalability - Present and Future

dfilppi

How the growth of R helps data-driven organizations succeed

Snowplow presentation for Amsterdam Meetup #3

Snowplow Analytics

What's So Unique About a Columnar Database?

FlyData Inc.

What's hot (20)

Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...

Introduction to basic data analytics tools

Dealing With Drift - Building an Enterprise Data Lake

The Stream is the Database - Revolutionizing Healthcare Data Architecture

Spark and the Enterprise by Tony Baer

Spark Summit Keynote by Seshu Adunuthula

Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

Winning the On-Demand Economy with Spark and Predictive Analytics

Spark Usage in Enterprise Business Operations

Predicting Loan Delinquency at One Million Transactions per Second

Netflix Data Engineering @ Uber Engineering Meetup

Presto summit israel 2019-04

Data analytics at a petabyte scale final

Zillow's favorite big data & machine learning tools

Middle Tier Scalability - Present and Future

How the growth of R helps data-driven organizations succeed

Snowplow presentation for Amsterdam Meetup #3

What's So Unique About a Columnar Database?

Similar to BigDataInOperationsV8

The Evolution of Big Data Pipelines at Intuit