SlideShare a Scribd company logo
- projects completed
- Helped with POC to determine which Hadoop Environment we'd use (Cloudera vs
Hortonworks)
- Did perfomance testing
- Applied 2 business cases and saw performance differences with BI Tools
(Tableau)
- Google Adwords to sql server automation with python
- wrote custom libraries that pulled google adwords data into sql server
- saved marketing 2 hours of work every day to pull data manually for
reports
- SQL Server to Google Big Query
- pulled various data sets with custom python libraries from sql server to
google big query
- Google Calendar to sql server for PTI tracking tool
- created calendar that employees put their pto data into and pulled the data
down using google apis
- helped track pto for employees/departments
- SFTP to SQL Server (paylocity and radius HR data)
- pulled data from sftp servers with python and loaded them to sql server
then to big query
- Relational DB to Hadoop library (FeastMode)
- wrote python library that pulled data from relational databases (sql server,
mysql, etc..) to hadoop through configurations and scanning the metadata of the
databases
- saved time as pulling a whole database would be as simple as typing 1 line
of code and it would load into hadoop in parquet formatted and snappy compressed
impala tables for reporting.
- Salesforce Python Library
- wrote a custom python library that used the Salesforce Bulk REST api that
used configurations to pull data from salesforce into hadoop and then eventually to
the legacy sql server system
- also had an incremental loading feature which sped up the data load as well
as made intraday(hourly) reporting possible
- used lambda architure to make incremental loads possible in hive
- saved the company from paying an extra $1200 a year for each license of DB
Amp
- there was no 3rd party tool to pull data from salesforce into hadoop
- Zuora Python Library
- wrote a custom python library that used the Zuora REST api that used
configurations to pull data from zuora into hadoop
- also had an incremental loading feature which sped up the data load as well
as made intraday(hourly) reporting possible
- used lambda architure to make incremental loads possible in hive
- there was no 3rd party tool to pull data from salesforce into hadoop
- project was done because legacy system wasn't working correctly
- Consuming Rabbit MQ messages
- built python libraries that read data and stored them into a reportable
format, or used messaging to trigger jobs to kick off
- example: live offer code redemptions for real time reporting.
- Built Company KPI's
- worked closely with finance to build company api's
- used mostly local datasets and netsuite data
- eventually included zuora and salesforce data
- used complex sql in Impala
- Built Company Billings dataset
- worked closely with marketing/finance to build billings data set that would
be the source of truth for financial reporting for the company
- used salesforce and zuora data that was pulled with python libraries as well
as netsuite data
- used complex sql in Impala
- Built Subscriber Snapshot dataset
- with direction from the data science team, helped build a subscriber
snapshot data set that had a daily and monthly granularity.
- had users subscription info, usage, profile, and demographics. Which then
segmented the subscribers.
- used complex sql in Impala
- Set Up Airflow workflow tool and used it for job automation
- set up airfow on linux machine and wrote custom DAGs using python
scripts for job scheduling.
- Used Cron for job scheduling
- used cron in linux to schedule jobs (eventually moved jobs to airflow)
- S3 to Hadoop
- pulled various data from amazon S3 and loaded the data into hadoop
- could be from the dev product team or a different part of the company
(Code School, Digital Tutors, etc.)
- Searchlight, Wootric, Desk and other Rest API's with Python
- created custom python libraries that used REST API's from 3rd party
vendors and loaded them to hadoop for reporting purposes
- Usually JSON responses
- Used External Definitions on top of json files to create tables
- used views with complex logic to extract data necessary (lateral view
explodes, etc.)
in a reportable format in Impala
- Kafka to hadoop and hadoop to kafka python libraries
- changed configs to let data from Kafka dispatchers go to hdfs
- then loaded those records into hadoop
- created dynamic dsl code that sent data from hdfs folders into kafka topics
- Created company wide hourly office fitness slack channel automated with python
- for a hackday project created a python library that used the slack REST apis
that notified employees to get out of their seats and do an hourly workout.
- it let them self check in if they did the workout or not and there was a
tableau dashboard that tracked the company's progress
- Created python library that used slack for etl alerts
- created python library that notified if jobs failed via slack by using slack
webhook

More Related Content

What's hot

Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
Edureka!
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringSputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Databricks
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Databricks
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
Databricks
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Big Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated DronesBig Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated Drones
Espeo Software
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
Jeff Magnusson
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit
 

What's hot (20)

Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringSputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Big Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated DronesBig Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated Drones
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
 

Viewers also liked

Hadoop/HBase POC framework
Hadoop/HBase POC frameworkHadoop/HBase POC framework
Hadoop/HBase POC framework
Doug Chang
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
Kamal A
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
Pranab Ghosh
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
Bhaskara Reddy Sannapureddy
 
Big Data Proof of Concept
Big Data Proof of ConceptBig Data Proof of Concept
Big Data Proof of Concept
RCG Global Services
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesDataWorks Summit
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Salah Amean
 
An example of a successful proof of concept
An example of a successful proof of conceptAn example of a successful proof of concept
An example of a successful proof of concept
ETLSolutions
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in Action
David Pittman
 

Viewers also liked (10)

Hadoop/HBase POC framework
Hadoop/HBase POC frameworkHadoop/HBase POC framework
Hadoop/HBase POC framework
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Big Data Proof of Concept
Big Data Proof of ConceptBig Data Proof of Concept
Big Data Proof of Concept
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
An example of a successful proof of concept
An example of a successful proof of conceptAn example of a successful proof of concept
An example of a successful proof of concept
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in Action
 

Similar to projects_with_descriptions

Data Infrastructure in Kumparan
Data Infrastructure in KumparanData Infrastructure in Kumparan
Data Infrastructure in Kumparan
Yosua Michael Maranatha
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
 
OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji
OSDC 2019 | Democratizing Data at Go-JEK by Maulik SonejiOSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji
OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji
NETWAYS
 
Nagarjuna_Damarla_Resume
Nagarjuna_Damarla_ResumeNagarjuna_Damarla_Resume
Nagarjuna_Damarla_ResumeNag Arjun
 
Nagarjuna_Damarla
Nagarjuna_DamarlaNagarjuna_Damarla
Nagarjuna_DamarlaNag Arjun
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
vcrisan
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
Cepoi Eugen
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
Jim Dowling
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData
FlyData Inc.
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
 
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Yahoo Developer Network
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Lviv Startup Club
 
Enterprise Data Science
Enterprise Data ScienceEnterprise Data Science
Enterprise Data Science
Misha Lisovich
 
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptlecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
YashJadhav496388
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
Aucfan
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 

Similar to projects_with_descriptions (20)

Data Infrastructure in Kumparan
Data Infrastructure in KumparanData Infrastructure in Kumparan
Data Infrastructure in Kumparan
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji
OSDC 2019 | Democratizing Data at Go-JEK by Maulik SonejiOSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji
OSDC 2019 | Democratizing Data at Go-JEK by Maulik Soneji
 
Nagarjuna_Damarla_Resume
Nagarjuna_Damarla_ResumeNagarjuna_Damarla_Resume
Nagarjuna_Damarla_Resume
 
Nagarjuna_Damarla
Nagarjuna_DamarlaNagarjuna_Damarla
Nagarjuna_Damarla
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
 
Enterprise Data Science
Enterprise Data ScienceEnterprise Data Science
Enterprise Data Science
 
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.pptlecturte 5. Hgfjhffjyy to the data will be 1.ppt
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 

projects_with_descriptions

  • 1. - projects completed - Helped with POC to determine which Hadoop Environment we'd use (Cloudera vs Hortonworks) - Did perfomance testing - Applied 2 business cases and saw performance differences with BI Tools (Tableau) - Google Adwords to sql server automation with python - wrote custom libraries that pulled google adwords data into sql server - saved marketing 2 hours of work every day to pull data manually for reports - SQL Server to Google Big Query - pulled various data sets with custom python libraries from sql server to google big query - Google Calendar to sql server for PTI tracking tool - created calendar that employees put their pto data into and pulled the data down using google apis - helped track pto for employees/departments - SFTP to SQL Server (paylocity and radius HR data) - pulled data from sftp servers with python and loaded them to sql server then to big query - Relational DB to Hadoop library (FeastMode) - wrote python library that pulled data from relational databases (sql server, mysql, etc..) to hadoop through configurations and scanning the metadata of the databases - saved time as pulling a whole database would be as simple as typing 1 line of code and it would load into hadoop in parquet formatted and snappy compressed impala tables for reporting. - Salesforce Python Library - wrote a custom python library that used the Salesforce Bulk REST api that used configurations to pull data from salesforce into hadoop and then eventually to the legacy sql server system - also had an incremental loading feature which sped up the data load as well as made intraday(hourly) reporting possible - used lambda architure to make incremental loads possible in hive - saved the company from paying an extra $1200 a year for each license of DB Amp - there was no 3rd party tool to pull data from salesforce into hadoop
  • 2. - Zuora Python Library - wrote a custom python library that used the Zuora REST api that used configurations to pull data from zuora into hadoop - also had an incremental loading feature which sped up the data load as well as made intraday(hourly) reporting possible - used lambda architure to make incremental loads possible in hive - there was no 3rd party tool to pull data from salesforce into hadoop - project was done because legacy system wasn't working correctly - Consuming Rabbit MQ messages - built python libraries that read data and stored them into a reportable format, or used messaging to trigger jobs to kick off - example: live offer code redemptions for real time reporting. - Built Company KPI's - worked closely with finance to build company api's - used mostly local datasets and netsuite data - eventually included zuora and salesforce data - used complex sql in Impala - Built Company Billings dataset - worked closely with marketing/finance to build billings data set that would be the source of truth for financial reporting for the company - used salesforce and zuora data that was pulled with python libraries as well as netsuite data - used complex sql in Impala - Built Subscriber Snapshot dataset - with direction from the data science team, helped build a subscriber snapshot data set that had a daily and monthly granularity. - had users subscription info, usage, profile, and demographics. Which then segmented the subscribers. - used complex sql in Impala - Set Up Airflow workflow tool and used it for job automation - set up airfow on linux machine and wrote custom DAGs using python scripts for job scheduling. - Used Cron for job scheduling - used cron in linux to schedule jobs (eventually moved jobs to airflow) - S3 to Hadoop - pulled various data from amazon S3 and loaded the data into hadoop - could be from the dev product team or a different part of the company (Code School, Digital Tutors, etc.)
  • 3. - Searchlight, Wootric, Desk and other Rest API's with Python - created custom python libraries that used REST API's from 3rd party vendors and loaded them to hadoop for reporting purposes - Usually JSON responses - Used External Definitions on top of json files to create tables - used views with complex logic to extract data necessary (lateral view explodes, etc.) in a reportable format in Impala - Kafka to hadoop and hadoop to kafka python libraries - changed configs to let data from Kafka dispatchers go to hdfs - then loaded those records into hadoop - created dynamic dsl code that sent data from hdfs folders into kafka topics - Created company wide hourly office fitness slack channel automated with python - for a hackday project created a python library that used the slack REST apis that notified employees to get out of their seats and do an hourly workout. - it let them self check in if they did the workout or not and there was a tableau dashboard that tracked the company's progress - Created python library that used slack for etl alerts - created python library that notified if jobs failed via slack by using slack webhook