Design for X: Exploring Product Design with Apache Spark and GraphLab

•

1 like•295 views

Ideas for designing data science products for: bots, knowledge gaps, IOT + fairness. Combining elements from Apache Spark and Turi's GraphLab products.

DESIGN FOR X
exploring data science product design with apache spark + graphlab {create}
@amcasari @Concur
data science summit 2016, san francisco
nasa

data science via random walks
senior product mgr +
data scientist
@ Concur Labs
control systems
engineering +
robotics + legos
officer in USN
operations research
analyst
wandering dirtbag +
conservation volunteer
EE +
applied math
+ complex systems
underwater robotics
engineer
technology
consultant
SAHM

INSANELY QUICK INTRO TO +
➤ Concur Accelerator Team
➤ Concur Labs
➤ Incubator (still brewing)
850K
Users log into Concur
300K
Expense reports
processed
120K
Trips booked
170M
Trips & expense
reports warehoused
Typical Day at Concur
How do we encourage a culture of innovation
while delivering quality service to our existing
33,000 business clients and 40M users?

DESIGN SPRINTS FOR DATA SCIENCEY PROTOTYPES
courtesy google ventures {we iterated…because data}

INSANELY QUICK INTRO TO
➤ “fast and general engine for large-scale data processing”
➤ advanced cyclic data ﬂow and in-memory computing > runs
10x-100x faster than Hadoop MR
➤ interactive shells in several languages (incl. SQL)
➤ performant + scalable
courtesy databricks

ALMOST AS INSANELY QUICK INTRO TO +
➤ graphlab create is based on a python data science library
developed + (some) os’d by turi
➤ SFrame <<>> Spark DataFrame | SparkRDD
➤ (yes it works with Open Source SFrame and GLC)
courtesy turi

WHAT PROBLEM DO WE WANT TO DATA SCIENCE?
Knowledge
Gaps
IOT
Networks
Bots
Fairness
+

➤ “We could {build this} {answer this better} if….”
➤ Reciprocal Data Applications
DESIGN FOR KNOWLEDGE GAPS
rda rdarda
choose
your data
storage
choose
your data
storage
choose
your data
storage
the app you
really
want to make

➤ “Can we trust our sensors?”
➤ “Has our network been hacked?”
DESIGN FOR IOT NETWORKS
device
device
device
alerts,
notifications,
monitoring
dashboards
data
services
Anomaly Detection Toolkit
TimeSeries <<>> SFrame

➤ “How do we create a conversational interface?”
….nothing new, just the burning question since Turing, 1950
DESIGN FOR BOTS
what NOT to do….
non-creepy
unisex
animal mascot
conversational
ui
choose
or
create
your
framework
choose your data storage
Advanced Deep Learning
Text Analysis Toolkit
Graph Analytics Toolkit

➤ know your biases + limitations
➤ in your data, their data, all the data
➤ in your feature selection
➤ in your algorithm
…..because ethics (these ALL bias your results + communications)
DESIGN FOR FAIRNESS
learn more at data & society’s case studies
+ +
open source. reproducible. transparent.

{THANKS MUCH}
➤ Concur is hiring!
➤ SAP + SAP Ariba are
hiring!
concurlabs.com
github.com/
concurlabs
➤ example notebooks will
be posted on our
github in the future
@amcasari

Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users. From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)

Albert Wong

Building a data platform doesn’t have to be like entering a portal to Stranger Things. Join us in one hour for Tableau in the Cloud: A Netflix Original where Albert Wong, Netflix’s analytics expert, will show you how to simplify your data stack to deliver self-service analytics at scale. Albert will discuss the details of connecting to big data, finding datasets, and discovering critical insights from visualizations. He will also share how Netflix is developing and growing their analytics ecosystem with Tableau, and how they prioritize sustaining their data culture of freedom and responsibility.

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

Big Data Spain

http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks. Session presented at Big Data Spain 2014 Conference 18th Nov 2014 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014

Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...

Data Con LA

This talk explores how Netflix equips its engineers with the freedom to find and introduce the right software for the job - even if it isn't used anywhere else in-house. Examples include how Netflix has enabled analysts to fluidly switch between MPP RDBMS and an auto-scaling Presto cluster, how Spark + NoSQL stores are used when deploying data sets to internal web apps, and how data scientists are enabled to work in the ML framework of their choosing and deploy models as a service.

Zillow's favorite big data & machine learning tools

njstevens

Netflix Data Engineering @ Uber Engineering Meetup

Blake Irvine

Congressional PageRank: Graph Analytics of US Congress With Neo4j

William Lyon

Interactions among members of any large organization are naturally a graph, yet the tools we use to analyze data about these organizations often ignore the graphiness of the domain and instead map the data into structures (such as relational databases) that make taking advantage of the relationships in the data much more difficult when it comes time for analysis. Collaboration networks are a perfect example. This talk will focus on analyzing one of the most powerful collaboration networks in the world, the US Congress. We will show how to model US Congressional data (legislators, bills, committees and the interactions among them) as a graph, how to import the data into the Neo4j graph database and how to write ad-hoc queries to answer simple questions such as “What are the topics of bills referred to committees on which California House Representatives serve?”. We will then see how we can combine a graph processing engine (Apache Spark) with Neo4j to run graph algorithms like PageRank on our data stored in Neo4j. This will allow us to identify influential legislators in the network and the topics over which they exert influence. This talk will touch on topics related to graph data modeling, graph databases, graph processing, and social network analysis that can be applied to many different domains.

Big data bi-mature-oanyc summitOpen Analytics

Insnap, a hyper-personalized ML-based platform acquired by The Honest Company, has been used to build a real-time data platform based on Apache Spark, Cassandra and Redshift. Users’ behavioral and transactional data have been used to build data models and ML models, and to drive use cases for marketing, growth, finance and operations. Learn how Honest Company has used Spark as a workhorse for 1) collecting, ETL and storing data from various sources including mysql, mongo, jde, Google analytics, Facebook, Localytics and REST API; 2) building data models and aggregating and generating reports of revenue, order fulfillment tracking, data pipeline monitoring and subscriptions; 3) Using ML to build model for user acquisitions, LTV and recommendations use cases. Spark replaced the monolithic codebase with flexible, scalable and robust pipelines. Databricks helped The Honest Company to focus on data instead of maintaining infrastructure. While Honest users got delightful recommendations to improve experience, data users at Honest understood users much better in terms of segmenting with behavioral information and advanced ML models, leading to increased revenue and retention.

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Databricks

PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds up data manipulation and display with features like: auto-visualization of Spark DataFrames, real-time Spark job progress monitoring, automated local install of Python and Scala kernels running with Spark, and much more. Come along and learn how you can use this tool in your own projects to visualize and explore data effortlessly with no coding. Oh, and if you prefer working with a Scala Notebook, this session is also for you, as PixieDust can also run on a Scala Kernel. Imagine being able to visualize your favorite Python chart engines from a Scala Notebook! We’ll finish the session with a demo combining Twitter, Watson Tone Analyzer, Spark Streaming, and some fun real-time visualizations–all running within a Notebook.

Big data-science-oanycOpen Analytics

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...

Databricks

Spark has established itself as the most popular platform for advanced scale-out analytical applications. It is deeply integrated with the Hadoop ecosystem, offers a set of powerful libraries and supports both Python and R. Because of these reasons Data Scientists have started to adopt Spark to train and deploy their models. When Spark 1.4 was released back in 2015, it included the new SparkR library: this API gave R users the exciting new option to run R code on Spark. And while the initial promise to provide a full R environment in Spark has been kept, it takes a deeper understanding of SparkR’s inner workings to make optimal use of its capabilities. This talk will give a comprehensive update on where we stand with Data Science applications in R based on the latest Spark releases. We will share insights from both a Startup solution and a Fortune 100 company where SparkR does Machine Learning in the Cloud on a scale that would have not been feasible previously: it’s parallel execution model runs in minutes and hours whereas conventional sequential approaches would take days and months. Suggested Topics: • An update on the SparkR architecture in the latest Spark release: using R with SparkSQL, MLlib and Spark’s Structured Streaming • How to handle practical challenges, e.g. running R on the cluster without a local installation, storing non-tabular results, such as Data Science models or plots, mixing Scala and R. • Scaling Big Compute Applications with SparkR: Parallelizing SparkR applications with User-Defined Functions (UDFs) and elastic scaling of resources in the Cloud • An Outlook on Machine Learning with SparkR and its ecosystem, frameworks and tools. • Plus: “Do I need to learn Python?”

Putting Lipstick on Apache Pig at Netflix

Jeff Magnusson

Slides from the Big Data Gurus meetup at Samsung R&D, August 14, 2013 This presentation covers the high level architecture of the Netflix Data Platform with a deep dive into the architecture, implementation, use cases, and future of Lipstick (https://github.com/Netflix/Lipstick) - our open source tool for graphically analyzing and monitoring the execution of Apache Pig scripts. Netflix uses Apache Pig to express many complex data manipulation and analytics workflows. While Pig provides a great level of abstraction between MapReduce and data flow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. To address this problem, we created (and open sourced) a tool named Lipstick that visualizes and monitors the progress and performance of Pig scripts.

Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...

Big Data Spain

Analyzing Data With Python

Sarah Guido

Data Warehousing Patterns for Hadoop

Michelle Ufford

Slides from Michelle Ufford's Data Warehousing talk at Hadoop Summit 2015. How can we take advantage of the veritable treasure trove of data stored in Hadoop to augment our traditional data warehouses? In this session, Michelle will share her experience with migrating GoDaddy’s data warehouse to Hadoop. She’ll explore how GoDaddy has adapted traditional data warehousing methodologies to work with Hadoop and will share example ETL patterns used by her team. Topics will also include how the integration of structured and unstructured data has exposed new insights, the resulting business impact, and tips for making your own Hadoop migration project more successful. Recording available here: https://www.youtube.com/watch?v=0AxoB-wJcZc

Janus graph lookingbackwardreachingforward

Demai Ni

JanusGraph: Looking Backward and Reaching Forward - by Jason Plurad (@pluradj): The JanusGraph project started at the Linux Foundation earlier this year, but it is not the new kid on the block. We'll start with a look at the origins and evolution of this open source graph database through the lens of a few IBM graph use cases. We'll discuss the new features in latest release of JanusGraph, and then take a look at future directions to explore together with the open community.

Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Spark Summit

Building Better Analytics Workflows (Strata-Hadoop World 2013)

Wes McKinney

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Sarah Guido

Given at Data Day Seattle 2015. Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.

Data Warehousing with Spark Streaming at Zalando

Databricks

Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time. The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta’s MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.

WeCloudData Toronto Open311 Workshop - Matthew Reyes

WeCloudData

Making it easy to work with data

Charles Smith

OSCON 2015

Charles Smith

An excursion into Graph Analytics with Apache Spark GraphX

Krishna Sankar

Microsoft R Server for Data Sciencea

Data Science Thailand

Managed Cluster Services

Adam Doyle

Diapositivas casa de la calidad

Ethetson Pineda

Design for 'X' and be prepared for anything

Team Consulting Ltd

What's hot

The Business Economics and Opportunity of Open Source Data Science

Revolution Analytics

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah

Databricks

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Databricks

Big data-science-oanycOpen Analytics

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...

Databricks

Putting Lipstick on Apache Pig at Netflix

Jeff Magnusson

Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...

Big Data Spain

Analyzing Data With Python

Sarah Guido

Data Warehousing Patterns for Hadoop

Michelle Ufford

Janus graph lookingbackwardreachingforward

Demai Ni

Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Spark Summit

Building Better Analytics Workflows (Strata-Hadoop World 2013)

Wes McKinney

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Sarah Guido

Data Warehousing with Spark Streaming at Zalando

Databricks

WeCloudData Toronto Open311 Workshop - Matthew Reyes

WeCloudData

Making it easy to work with data

Charles Smith

OSCON 2015

Charles Smith

An excursion into Graph Analytics with Apache Spark GraphX

Krishna Sankar

Microsoft R Server for Data Sciencea

Data Science Thailand

Managed Cluster Services

Adam Doyle

What's hot (20)

The Business Economics and Opportunity of Open Source Data Science

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Big data-science-oanyc

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...

Putting Lipstick on Apache Pig at Netflix

Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...

Analyzing Data With Python

Data Warehousing Patterns for Hadoop

Janus graph lookingbackwardreachingforward

Magellen: Geospatial Analytics on Spark by Ram Sriharsha

Building Better Analytics Workflows (Strata-Hadoop World 2013)

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Data Warehousing with Spark Streaming at Zalando

WeCloudData Toronto Open311 Workshop - Matthew Reyes

Making it easy to work with data

OSCON 2015

An excursion into Graph Analytics with Apache Spark GraphX

Microsoft R Server for Data Sciencea

Managed Cluster Services

Viewers also liked

Diapositivas casa de la calidad

Ethetson Pineda

Design for 'X' and be prepared for anything

Team Consulting Ltd

Qfd - Despliegue de la función de Calidad

Kelly Cuervo

Determinación de características qfdSagui Lab

Module 4: Design For X

Naseel Ibnu Azeez

Design for x : Design for Manufacturing,Design for Assembly

Naseel Ibnu Azeez

Concurrent engineering is a contemporary approach to DFSS. DFX techniques are part of detail design and are ideal approaches to improve life-cycle cost, quality, increased design flexibility, and increased efficiency and productivity using the concurrent design concepts (Maskell 1991). Benefits are usually pinned as competitiveness measures, improved decision-making, and enhanced operational efficiency. The letter “X” in DFX is made up of two parts: life-cycle processes x and performance measure

Metodologia Casa de la Calidad QFD

J Carlos Ordoñez

Material Selection and Design Standards

Naseel Ibnu Azeez

Viewers also liked (8)

Diapositivas casa de la calidad

Design for 'X' and be prepared for anything

Qfd - Despliegue de la función de Calidad

Determinación de características qfd

Module 4: Design For X

Design for x : Design for Manufacturing,Design for Assembly

Metodologia Casa de la Calidad QFD

Material Selection and Design Standards

Similar to Design for X: Exploring Product Design with Apache Spark and GraphLab

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017

Demi Ben-Ari

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...

Demi Ben-Ari

Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...

Codemotion

Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system using tools like: Web Services,Spark,Cassandra,MongoDB,AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.

Monitoring Big Data Systems - "The Simple Way"

Demi Ben-Ari

Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems. All of a sudden to monitor all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like: Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services. Not only the tools, what should you monitor about the actual data that flows in the system? And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy. Demi Ben-Ari is a Co-Founder and CTO @ Panorays. Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems. Describing himself as a software development groupie, Interested in tackling cutting edge technologies. Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/

Data infrastructure architecture for medium size organization: tips for colle...

DataWorks Summit/Hadoop Summit

Building search and discovery services for Schibsted (LSRS '17)

Sandra Garcia

Scaling up with Cisco Big Data: Data + Science = Data Science

eRic Choo

Your Self-Driving Car - How Did it Get So Smart?

Hortonworks

As we all can appreciate, “teaching” a vehicle to drive under the full range of conditions it will encounter (i.e. road conditions, weather conditions, behavior of other vehicles) is a daunting proposition. If merely the thought of this makes you nervous, you’re not alone – according to the American Automobile Association (AAA), 75 percent of consumers are not yet ready to embrace self-driving cars. However, that is the very challenge facing automakers – teaching vehicles to unfailingly assess and respond to any combination of operational conditions “on-the-fly” through discrete rules (algorithms) governing a vehicle’s behavior. Join Hortonworks and NorCom at the upcoming webinar as we discuss: •Evolution of the autonomous driving •Traditional data management approaches and main challenges associated with them •How NorCom and Hortonworks can address those challenges and accelerate the pace of autonomous development

Off-Label Data Mesh: A Prescription for Healthier Data

HostedbyConfluent

"Data mesh is a relatively recent architectural innovation, espoused as one of the best ways to fix analytic data. We renegotiate aged social conventions by focusing on treating data as a product, with a clearly defined data product owner, akin to that of any other product. In addition, we focus on building out a self-service platform with integrated governance, letting consumers safely access and use the data they need to solve their business problems. Data mesh is prescribed as a solution for _analytical data_, so that conventionally analytical results (think weekly sales or monthly revenue reports) can be more accurately and predictably computed. But what about non-analytical business operations? Would they not also benefit from data products backed by self-service capabilities and dedicated owners? If you've ever provided a customer with an analytical report that differed from their operational conclusions, then this talk is for you. Adam discusses the resounding successes he has seen from applying data mesh _off-label_ to both analytical and operational domains. The key? Event streams. Well-defined, incrementally updating data products that can power both real-time and batch-based applications, providing a single source of data for a wide variety of application and analytical use cases. Adam digs into the common areas of success seen across numerous clients and customers and provides you with a set of practical guidelines for implementing your own minimally viable data mesh. Finally, Adam covers the main social and technical hurdles that you'll encounter as you implement your own data mesh. Learn about important data use cases, data domain modeling techniques, self-service platforms, and building an iteratively successful data mesh."

Big Graph Analytics on Neo4j with Apache Spark

Kenny Bastani

In this talk I will introduce you to a Docker container that provides you an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You'll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection. Database technologies have evolved to be able to store big data, but are largely inflexible. For complex graph data models stored in a relational database there may be tedious transformations and shuffling around of data to perform large scale analysis. Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement. Speakers

The Future of Data Science

DataWorks Summit

Initiate Edinburgh 2019 - Big Data Meets AI

Amazon Web Services

Over 90% of today’s data has been generated in the last two years, and growth rates continue to climb. In this session, we’ll step through challenges and best practices with data capturing, how to derive meaningful insights to help predict the future, and common pitfalls in data analysis. Come discover how integrated solutions involving Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon Machine Learning/Deep Learning result in effective data systems for data scientists and business users, alike.

Introduction to Deep Learning

Amazon Web Services

by Dan Romuald Mbanga, Business Development Manager, AWS Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. In this workshop, we will provide an overview of deep learning focusing on getting started with the TensorFlow and Keras frameworks on AWS. Level 100

The hidden engineering behind machine learning products at Helixa

Alluxio, Inc.

Tour de France Azure PaaS 6/7 Ajouter de l'intelligence

Alex Danvy

Nous assisterons probablement à une rupture générationnelle entre les apps avec de l'intelligence artificielle et celles sans. Ces dernières, comme les applications en mode caractères à l'arrivée des interfaces graphiques, auront du mal à perdurer. Azure met à dispositions 3 approches pour ajouter de l'IA dans une app, avec un niveau de difficulté graduel, de l'outil ne nécessitant aucune compétence particulière à celui dédié aux Data Scientistes.

Microsoft Technologies for Data Science 201612

Mark Tabladillo

Delivered to SQL Saturday BI Edition -- Atlanta, GA Microsoft provides several technologies in and around Azure which can be used for casual to serious data science. This presentation provides an overview of the major Microsoft options for both on-premise and cloud-based data science (and hybrid). These technologies have been used by the presenter in various companies and industries, both as a Microsoft consultant and previously independent consultant. As well, the speaker provides insights into data science careers, information which helps imply where the business will likely be for consultants and partners.

Big Data made easy in the era of the Cloud - Demi Ben-Ari

Demi Ben-Ari

Social media analytics using Azure Technologies

Koray Kocabas

Social media are computer-mediated tools that allow people to create, share or exchange information, ideas, and pictures/videos in virtual communities and networks. To sum up Social Media is everything for your customers and Your company need to listen them to understand, make a custom offer or improve loyalty etc. Azure Stream Analytics and HDInsight platforms can solve this problem for you. We'll focus on how to get Twitter data using Stream Analytics and how to make data enrichment and storing using HDInsight and What is the problem about sentiment analytics using Azure Machine Learning.

Machine Learning on dirty data - Dataiku - Forum du GFII 2014

Le_GFII

Intervention de Florian Douetteau, CEO, Dataiku au Forum du GFII 2014. Atelier : "De la Business Intelligence aux analyses prédictives grâce aux Big Data", le 08/12/14. Abstract : Le prédictif est la nouvelle frontière de la « data intelligence ». Les premiers développements industriels voient le jour, illustrant concrètement l'apport de ces approches pour administrer plus efficacement des systèmes complexes (ville intelligente, transports, énergie, maintenance, etc.), pour outiller la prise de décision dans la gestion du risque (naturel, industriel, client, économique, financier, etc.) ou pour affiner la personnalisation des offres et la recommandation dans le marketing et la publicité. Quelles que soient les applications, il ne s'agitpas de prévoir l'avenir mais de réduire l'incertitude en modélisant des probabilités et des scénarios d'évolution. Les technologies sont entrées dans une phase opérationnelle. Les avancées du Big Data dans la modélisation, le machine learning, ou l'algorithmique sémantique apportent désormais la puissance calculatoire qui faisait auparavant défaut pour fouiller les vastes ensembles de données non-structurées disponibles sur le web, les média sociaux et l'internet des objets. Au-delà des défis en termes de R&D, l'enjeu aujourd'hui est de simplifier l'accès aux approches prédictives pour en démocratiser les usages dans les différents métiers. Des solutions innovantes sont développées pour faciliter la conception de modèles et simplifier le développement d'applications "Web Services" ou "BI Mobile" pour mieux toucher les décideurs. Les modes de distribution en cloud permettent de mutualiser les ressources. Des modèles économiques innovants sont également expérimentés par les fournisseurs de solutions pour réduire les coûts d'accès aux technologies et essaimer dans les entreprises. Le Forum du GFII consacrera un atelier sur ce thème. Des fournisseurs de solutions interviendront pour présenter des cas d'usages en Business Intelligence, en maintenance prédictive et dans la gestion du risque naturel. Source : http://forum.gfii.fr/forum/de-la-business-intelligence-au-predictif-grace-aux-big-data

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Databricks

Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations. We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.

Similar to Design for X: Exploring Product Design with Apache Spark and GraphLab (20)

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...

Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...

Monitoring Big Data Systems - "The Simple Way"

Data infrastructure architecture for medium size organization: tips for colle...

Building search and discovery services for Schibsted (LSRS '17)

Scaling up with Cisco Big Data: Data + Science = Data Science

Your Self-Driving Car - How Did it Get So Smart?

Off-Label Data Mesh: A Prescription for Healthier Data

Big Graph Analytics on Neo4j with Apache Spark

The Future of Data Science

Initiate Edinburgh 2019 - Big Data Meets AI

Introduction to Deep Learning

The hidden engineering behind machine learning products at Helixa

Tour de France Azure PaaS 6/7 Ajouter de l'intelligence

Microsoft Technologies for Data Science 201612

Big Data made easy in the era of the Cloud - Demi Ben-Ari

Social media analytics using Azure Technologies

Machine Learning on dirty data - Dataiku - Forum du GFII 2014

Best Practices for Building and Deploying Data Pipelines in Apache Spark

More from Amanda Casari

When Privacy Scales - Intelligent Product Design under GDPR

Amanda Casari

Data-driven companies making intelligent products must design for security and privacy to be competitive globally. The EU General Data Protection Regulation (GDPR), implemented May 2018, is the benchmark that global data privacy will be measured against. This presentation outlines the basic tenets of personal data and details the high-level changes that GDPR-compliant businesses face. It translates the current and near-future impact to teams designing products driven by machine learning and artificial intelligence and shares use cases of how SAP Concur is designing to meet this challenge while still delivering services to its end users that are driven by advanced algorithms. Presented at The AI Conference, San Francisco, September 2018

Scaling Data Science Products, Not Data Science Teams

Amanda Casari

Spark Hearts GraphLab Create

Amanda Casari

Apache Spark for Everyone - Women Who Code Workshop

Amanda Casari

20160512 apache-spark-for-everyone

Amanda Casari

Feature Engineering for Machine Learning at QConSP

Amanda Casari

Machine learning fits mathematical models to date to derive insights or make predictions. Engineering the features that sit between data and models is a crucial step in the machine learning pipeline, because the right features can ease the difficulty of modeling and enable results of higher quality. In this talk, we will dive deeper into the mechanisms behind popular feature engineering techniques and walk through use of where these techniques are most useful. You will be able to better identify which methods to use based on your data and the problem you are working to solve.

Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...

Amanda Casari

When describing a product as "data-driven" or "fueled by machine learning," it can be difficult to parse a common, fundamental definition of what makes an application "intelligent." In this talk, we will cover how you can peel back the buzzwords from the space of data science, machine learning, and artificial intelligence. You will be able to: - Start sketching your own frameworks for understanding and evaluating these products - Better understand how things can go wrong - Know what questions to ask product vendors, and have a better understanding of their answers - Learn more about data science as a process, as people organizations, as a product and as a service

PyLadies Seattle - Lessons in Interactive Visualizations

Amanda Casari

More from Amanda Casari (8)

When Privacy Scales - Intelligent Product Design under GDPR

Scaling Data Science Products, Not Data Science Teams

Spark Hearts GraphLab Create

Apache Spark for Everyone - Women Who Code Workshop

20160512 apache-spark-for-everyone

Feature Engineering for Machine Learning at QConSP

Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...

PyLadies Seattle - Lessons in Interactive Visualizations

Recently uploaded

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

XfilesPro

Orion Context Broker introduction 20240604

Fermin Galan

RISE with SAP and Journey to the Intelligent Enterprise

Srikant77

Accelerate Enterprise Software Engineering with Platformless

WSO2

Key takeaways: Challenges of building platforms and the benefits of platformless. Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience. How Choreo enables the platformless experience. How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo. Demo of an end-to-end app built and deployed on Choreo.

Understanding Globus Data Transfers with NetSage

Globus

NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?

Vitthal Shirke Microservices Resume Montevideo

Vitthal Shirke

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Juraj Vysvader

Globus Compute Introduction - GlobusWorld 2024

Globus

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

rickgrimesss22

BoxLang: Review our Visionary Licenses of 2024

Ortus Solutions, Corp

Enhancing Research Orchestration Capabilities at ORNL.pdf

Globus

Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.

Enterprise Resource Planning System in Telangana

NYGGS Automation Suite

Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics. To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/

SOCRadar Research Team: Latest Activities of IntelBroker

SOCRadar

The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month. The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies. However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News. Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!

Corporate Management | Session 3 of 3 | Tendenci AMS

Tendenci - The Open Source AMS (Association Management Software)

Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have. For more Tendenci AMS events, check out www.tendenci.com/events

Into the Box 2024 - Keynote Day 2 Slides.pdf

Ortus Solutions, Corp

Using IESVE for Room Loads Analysis - Australia & New Zealand

IES VE

How to Position Your Globus Data Portal for Success Ten Good Practices

Globus

Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.

Large Language Models and the End of Programming

Matt Welsh

A Sighting of filterA in Typelevel Rite of Passage

Philip Schwarz

Globus Compute wth IRI Workflows - GlobusWorld 2024

Globus

As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.

Recently uploaded (20)

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

Orion Context Broker introduction 20240604

RISE with SAP and Journey to the Intelligent Enterprise

Accelerate Enterprise Software Engineering with Platformless

Understanding Globus Data Transfers with NetSage

Vitthal Shirke Microservices Resume Montevideo

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Globus Compute Introduction - GlobusWorld 2024

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

BoxLang: Review our Visionary Licenses of 2024

Enhancing Research Orchestration Capabilities at ORNL.pdf

Enterprise Resource Planning System in Telangana

SOCRadar Research Team: Latest Activities of IntelBroker

Corporate Management | Session 3 of 3 | Tendenci AMS

Into the Box 2024 - Keynote Day 2 Slides.pdf

Using IESVE for Room Loads Analysis - Australia & New Zealand

How to Position Your Globus Data Portal for Success Ten Good Practices

Large Language Models and the End of Programming

A Sighting of filterA in Typelevel Rite of Passage

Globus Compute wth IRI Workflows - GlobusWorld 2024

Design for X: Exploring Product Design with Apache Spark and GraphLab

1. DESIGN FOR X exploring data science product design with apache spark + graphlab {create} @amcasari @Concur data science summit 2016, san francisco nasa

2. data science via random walks senior product mgr + data scientist @ Concur Labs control systems engineering + robotics + legos officer in USN operations research analyst wandering dirtbag + conservation volunteer EE + applied math + complex systems underwater robotics engineer technology consultant SAHM

3. INSANELY QUICK INTRO TO + ➤ Concur Accelerator Team ➤ Concur Labs ➤ Incubator (still brewing) 850K Users log into Concur 300K Expense reports processed 120K Trips booked 170M Trips & expense reports warehoused Typical Day at Concur How do we encourage a culture of innovation while delivering quality service to our existing 33,000 business clients and 40M users?

4. DESIGN SPRINTS FOR DATA SCIENCEY PROTOTYPES courtesy google ventures {we iterated…because data}

5. INSANELY QUICK INTRO TO ➤ “fast and general engine for large-scale data processing” ➤ advanced cyclic data ﬂow and in-memory computing > runs 10x-100x faster than Hadoop MR ➤ interactive shells in several languages (incl. SQL) ➤ performant + scalable courtesy databricks

6. ALMOST AS INSANELY QUICK INTRO TO + ➤ graphlab create is based on a python data science library developed + (some) os’d by turi ➤ SFrame <<>> Spark DataFrame | SparkRDD ➤ (yes it works with Open Source SFrame and GLC) courtesy turi

7. WHAT PROBLEM DO WE WANT TO DATA SCIENCE? Knowledge Gaps IOT Networks Bots Fairness +

8. ➤ “We could {build this} {answer this better} if….” ➤ Reciprocal Data Applications DESIGN FOR KNOWLEDGE GAPS rda rdarda choose your data storage choose your data storage choose your data storage the app you really want to make

9. ➤ “Can we trust our sensors?” ➤ “Has our network been hacked?” DESIGN FOR IOT NETWORKS device device device alerts, notifications, monitoring dashboards data services Anomaly Detection Toolkit TimeSeries <<>> SFrame

10. ➤ “How do we create a conversational interface?” ….nothing new, just the burning question since Turing, 1950 DESIGN FOR BOTS what NOT to do…. non-creepy unisex animal mascot conversational ui choose or create your framework choose your data storage Advanced Deep Learning Text Analysis Toolkit Graph Analytics Toolkit

11. ➤ know your biases + limitations ➤ in your data, their data, all the data ➤ in your feature selection ➤ in your algorithm …..because ethics (these ALL bias your results + communications) DESIGN FOR FAIRNESS learn more at data & society’s case studies + + open source. reproducible. transparent.

12. {THANKS MUCH} ➤ Concur is hiring! ➤ SAP + SAP Ariba are hiring! concurlabs.com github.com/ concurlabs ➤ example notebooks will be posted on our github in the future @amcasari

Design for X: Exploring Product Design with Apache Spark and GraphLab

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Design for X: Exploring Product Design with Apache Spark and GraphLab

Similar to Design for X: Exploring Product Design with Apache Spark and GraphLab (20)

More from Amanda Casari

More from Amanda Casari (8)

Recently uploaded

Recently uploaded (20)

Design for X: Exploring Product Design with Apache Spark and GraphLab