Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Beginner's Guide to Building Data Pipelines with Luigi

54,833 views

Published on

In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.

Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.

In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.

Published in: Technology
  • D0WNL0AD FULL ▶ ▶ ▶ ▶ http://1lite.top/RnLsm ◀ ◀ ◀ ◀
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • D0WNL0AD FULL ▶ ▶ ▶ ▶ http://1lite.top/RnLsm ◀ ◀ ◀ ◀
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download Full EPUB Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download Full doc Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download PDF EBOOK here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download EPUB Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download doc Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

A Beginner's Guide to Building Data Pipelines with Luigi

  1. 1. A Beginner’s Guide to Building Data Pipelines with
  2. 2. Where should I focus my outbound sales and marketing efforts to yield the highest possible ROI? UK Limited Companies Customer CRM Data Predictive Model
  3. 3. With big data, comes big responsibility
  4. 4. Hard to maintain, extend, and… look at. Script Soup omg moar codez Code More Codes
  5. 5. if __name__ == '__main__': today = datetime.now().isoformat()[:10] <- Custom Date handling arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process arguments') arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read (supports globstar wildcards)', required=True) arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of rows to save to DB at once') arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data were released') arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen', default='INFO') args = arg_parser.parse_args() The Old Way Define a command line interface for every task?
  6. 6. log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today)) log.setLogLevel(args.log_level, 'screen') <- Custom logging table_date = parse_date(args.table_date, datetime.now()) log.info('Starting Companies House data loader...') ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date) ch_loader.go(args.file_names) <- What to do if this fails? log.info('Loader complete. Starting Companies House updater') ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date, company_status_params=company_status_params) <- Need to clean up if this fails ch_updater.go() The Old Way Long, processor intensive tasks stacked together
  7. 7. ● Open-sourced & maintained by Spotify data team ● Erik Berhhardsson and Elias Freider. Maintained by Arash Rouhani. ● Abstracts batch processing jobs ● Makes it easy to write modular code and create dependencies between tasks. Luigi to the rescue!
  8. 8. ● Task templating ● Dependency graphs ● Resumption of data flows after intermediate failure ● Command line integration ● Error emails Luigi
  9. 9. Luigi 101- Counting the number of companies in the UK companies.csv Count companies count.txt input() output()
  10. 10. class CompanyCount(luigi.Task): def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries("companies.csv") with self.output().open("w") as out_file: out_file.write(count) Company Count Job in Luigi code
  11. 11. Luigi 101- Keeping our count up to date companies.csv Count companies count.txt input() output()Companies Download Companies Data Server output() requires()
  12. 12. class CompanyCount(luigi.Task): def requires(self): return CompanyDownload() def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries(self.input()) with self.output().open("w") as out_file: out_file.write(count) Company count with download dependency the output of the required task this task must complete before CompanyCount runs
  13. 13. Download task class CompanyDownload(luigi.Task): def output(self): return luigi.LocalTarget("companies.csv") def run(self): data = get_company_download() with self.output().open('w') as out_file: out_file.write(data) local output to be picked up by previous task download the data and write it to the output Target
  14. 14. $ python company_flow.py CompanyCount --local-scheduler DEBUG: Checking if CompanyCount() is complete DEBUG: Checking if CompanyDownload() is complete INFO: Scheduled CompanyCount() (PENDING) INFO: Scheduled CompanyDownload() (PENDING) INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 10076] Worker Worker(...) running CompanyDownload() INFO: [pid 10076] Worker Worker(...) done CompanyDownload() DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 10076] Worker Worker(...) running CompanyCount() INFO: [pid 10076] Worker Worker(...) done CompanyCount() DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... INFO: Done
  15. 15. Time dependent tasks - change in companies Companies Count Task(Date 1) Companies Count Task(Date 2) Companies Count Task(Date 3) Companies Delta company_count_ delta.txt output() input() input() input()
  16. 16. class AnnualCompanyCountDelta(luigi.Task): year = luigi.Parameter() def requires(self): tasks = [] for month in range(1, 13): tasks.append(CompanyCount(dt.datetime.strptime( "{}-{}-01".format(self.year, month), "%Y-%m-%d")) ) return tasks # not shown: output(), run() Parameterising Luigi tasks define parameter generate dependencies
  17. 17. class CompanyCount(luigi.Task): date = luigi.DateParameter(default=datetime.date.today()) def requires(self): return CompanyDownload(self.date) def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries(self.input()) with self.output().open("w") as out_file: out_file.write(count) Adding the date dependency to Company Count added date dependency to company count
  18. 18. The central scheduler $ luigid & # start central scheduler in background $ python company_flow.py CompanyCountDelta --year 2014 by default, localhost:8082
  19. 19. Persisting our data companies.csv Count companies(Date) count.txt output()Companies Download(Date) Companies Data Server output() requires(Date) Companies ToMySQL(Date) output() SQL Database requires(Date)
  20. 20. class CompaniesToMySQL(luigi.sqla.CopyToTable): date = luigi.DateParameter() columns = [(["name", String(100)], {}), ...] connection_string = "mysql://localhost/test" # or something table = "companies" # name of the table to store data def requires(self): return CompanyDownload(self.date) def rows(self): for row in self.get_unique_rows(): # uses self.input() yield row Persisting our data
  21. 21. My pipes broke # ./client.cfg [core] error-email: dylan@growthintel.com, stuart@growthintel.com
  22. 22. Things we missed out There are lots of task types which can be used which we haven’t mentioned ● Hadoop ● Spark ● ssh ● Elasticsearch ● Hive ● Pig ● etc. Check out the luigi.contrib package
  23. 23. class CompanyCount(luigi.contrib.hadoop.JobTask): chunks = luigi.Parameter() def requires(self): return [CompanyDownload(chunk) for chunk in chunks] def output(self): return luigi.contrib.hdfs.HdfsTarget("companines_count.tsv") def mapper(self, line): yield "count", 1 def reducer(self, key, values): yield key, sum(values) Counting the companies using Hadoop split input in chunks HDFS target map and reduce methods instead of run()
  24. 24. ● Doesn’t provide a way to trigger flows ● Doesn’t support distributed execution Luigi Limitations
  25. 25. Onwards ● The docs: http://luigi.readthedocs.org/ ● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/ ● The source: https://github.com/spotify/luigi ● The maintainers are really helpful, responsive, and open to any and all PRs!
  26. 26. Stuart Coleman @stubacca81 / stuart@growthintel.com Dylan Barth @dylan_barth / dylan@growthintel.com Thanks! We’re hiring Python data scientists & engineers! http://www.growthintel.com/careers/

×