class: center, middle, inverse, title-slide .title[ # Databases with R (part 4) ] .subtitle[ ##
https://pos.it/databases-24
] .author[ ### Kirill Müller ] .author[ ### cynkra GmbH ] .date[ ### August 12, 2024 ] --- <style type="text/css"> .pull-left { margin-top: -25px; } .pull-right { margin-top: -25px; } .r.hljs.remark-code { font-size: 20px; } .remark-code { font-size: 20px; } .fansi.fansi-error { color: #D70040; } pre { white-space: pre-wrap; /* css-3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* Internet Explorer 5.5+ */ margin-bottom: 0px; } .reg-margin-bottom { margin-bottom: 25px; } .remark-slide-content { padding-top: 0px; padding-bottom: 0px; } .remark-slide-scaler { overflow-y: auto; } .font17 { font-size: 17px; } .font14 { font-size: 14px; } .script-number-font { color: inherit !important; opacity: 0.9 !important; transform: scale(1.8); transform-origin: top left; } .script-number { color: inherit !important; opacity: 0.9 !important; position: absolute; top: 12px; right: 20px; transform: scale(1.8); transform-origin: top right; } .exercise-timer { position: fixed; bottom: 12px; left: 20px; transform: scale(1.8); transform-origin: bottom left; } .exercise-timer img { max-width: none; /* Prevent the image from being constrained by the container */ } </style> # Backup Lab - Instructions 1 1. Analyze Venture Capital data! 1. Download data from <https://www.sec.gov/data-research/sec-markets-data/form-d-data-sets>. 2023 to 2024 should be enough to start out! 1. Unzip them programmatically in a dir. called `sec-unzipped` within your working directory 1. Familiarise yourself with the content and check-out the next slides providing more context on the data. Also a data dictionary can be found here: https://www.sec.gov/files/Form_D.pdf <div class="script-number"><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="100" height="20" role="img" aria-label="Script: db_40.R"><title>Script: db_40.R</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="100" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="43" height="20" fill="#555"/><rect x="43" width="57" height="20" fill="#fe7d37"/><rect width="100" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="225" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="330">Script</text><text x="225" y="140" transform="scale(.1)" fill="#fff" textLength="330">Script</text><text aria-hidden="true" x="705" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="470">db_40.R</text><text x="705" y="140" transform="scale(.1)" fill="#fff" textLength="470">db_40.R</text></g></svg></div> --- # Backup Lab - Instructions 2 1. Choose a quarter and read each tsv file into a persistent DuckDB. Create the following tables: `form_d`, `issuers`, `offering`, `recipients` - Hint: `duckdb::tbl_file()` <div class="script-number"><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="100" height="20" role="img" aria-label="Script: db_40.R"><title>Script: db_40.R</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="100" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="43" height="20" fill="#555"/><rect x="43" width="57" height="20" fill="#fe7d37"/><rect width="100" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="225" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="330">Script</text><text x="225" y="140" transform="scale(.1)" fill="#fff" textLength="330">Script</text><text aria-hidden="true" x="705" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="470">db_40.R</text><text x="705" y="140" transform="scale(.1)" fill="#fff" textLength="470">db_40.R</text></g></svg></div> --- # Backup Lab - Instructions 3 1. Explore and analyze the data. - Check for duplicates and identify combinations of columns that uniquely each observation (hint: also help yourself with the data dictionary) - Number of submissions per month and submission type - Amount raised per state - Rank the ten firms that raised the most capital - Hint: `dbplyr::window_order()` - Number of submissions per state <div class="script-number"><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="100" height="20" role="img" aria-label="Script: db_40.R"><title>Script: db_40.R</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="100" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="43" height="20" fill="#555"/><rect x="43" width="57" height="20" fill="#fe7d37"/><rect width="100" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="225" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="330">Script</text><text x="225" y="140" transform="scale(.1)" fill="#fff" textLength="330">Script</text><text aria-hidden="true" x="705" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="470">db_40.R</text><text x="705" y="140" transform="scale(.1)" fill="#fff" textLength="470">db_40.R</text></g></svg></div> --- # Backup Lab - Instructions 4 1. Create a dm object setting primary and foreign keys 1. Draw the dm object and check for constraints violations (hint: `dm_examine_constraints()`) 1. Load all the files for each quarter available and create a dm for the full set of data 1. Repeat the analysis for the full set of data for "New York", "California" and "Texas" (when applicable) <div class="script-number"><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="100" height="20" role="img" aria-label="Script: db_40.R"><title>Script: db_40.R</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="100" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="43" height="20" fill="#555"/><rect x="43" width="57" height="20" fill="#fe7d37"/><rect width="100" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="225" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="330">Script</text><text x="225" y="140" transform="scale(.1)" fill="#fff" textLength="330">Script</text><text aria-hidden="true" x="705" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="470">db_40.R</text><text x="705" y="140" transform="scale(.1)" fill="#fff" textLength="470">db_40.R</text></g></svg></div> --- # Backup Lab - Glossary 1 ## Form D A standard form that issuers file with the SEC after they sell their first securities in any offering under Regulation D. Form D is a brief notice that includes basic information about the company and the offering, such as the names and addresses of the company’s executive officers, the size of the offering, and the date of first sale. Source: https://www.sec.gov/resources-small-businesses/cutting-through-jargon-z#EO --- # Backup Lab - Glossary 2 ## SEC The U.S. Securities and Exchange Commission (SEC) is an independent agency of the United States federal government, created in the aftermath of the Wall Street Crash of 1929. The primary purpose of the SEC is to enforce the law against market manipulation. Source: <https://en.wikipedia.org/wiki/U.S._Securities_and_Exchange_Commission> ## Regulation D It includes two SEC rules—Rules 504 and 506—that issuers often rely on to sell securities in unregistered offerings. Most private placements are conducted pursuant to Rule 506. --- # Glossary 1 ## Query A request for data or information from a database, usually written in SQL. ## Schema The structure that defines the organization of data in a database, including tables and views. ## Catalog Collection of metadata that describes the structure and organization of a database. --- # Glossary 2 ## Partition Dividing a database into distinct, smaller, more manageable pieces for improved performance and organization. ## Dump Exporting the contents of a database, typically to a file, which can be used for backup or transfer purposes. ## Views A virtual table based on the result-set of an SQL query. Views do not store data themselves but display data stored in tables. --- # Glossary 3 ## Transactions A sequence of one or more SQL operations executed as a single unit, ensuring data integrity. ## ACID A set of properties that guarantee database transactions are processed reliably (Atomicity, Consistency, Isolation, Durability). ## Normalization The process of organizing data to reduce redundancy and improve data integrity. --- # Glossary 4 ## Index A database structure that improves the speed of data retrieval operations on a database table. ## OLTP Online Transaction Processing are systems that manage transaction-oriented applications, typically for data entry and retrieval transaction processing. ## OLAP Online Analytical Processing are systems that allow users to analyze large volumes of data from multiple perspectives. --- # Useful resources 1 ## Databases - Article: [Best Practices in Working with Databases](https://solutions.posit.co/connections/db/) - R: [R for Data Science (2e)](https://r4ds.hadley.nz/databases) - Book: [Fundamentals of Database Management Systems, 3rd Edition](https://www.wiley.com/en-us/Fundamentals+of+Database+Management+Systems%2C+3rd+Edition-p-9781119907442) - Book: [Designing Data-Intensive Applications](https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/) - Link: [DB-Engines](https://db-engines.com/en/) - Lectures: [CMU Database Group](https://www.youtube.com/c/cmudatabasegroup) --- # Useful resources 2 # SQL - Book: [SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis](https://www.wiley.com/en-us/SQL+for+Data+Scientists%3A+A+Beginner's+Guide+for+Building+Datasets+for+Analysis-p-9781119669364) - Book: [Learning SQL, 3rd Edition](https://www.oreilly.com/library/view/learning-sql-3rd/9781492057604/) --- # Useful resources 3 ## DuckDB and duckplyr - Paper: [DuckDB: an Embeddable Analytical Database](https://dl.acm.org/doi/10.1145/3299869.3320212) - Book: [DuckDB in Action](https://www.manning.com/books/duckdb-in-action) - Video: [In-Process Analytical Data Management with DuckDB - posit::conf(2023)](https://www.youtube.com/watch?v=9OFzOvV-to4) - Video: [DuckDB – The SQLite for Analytics (Mark Raasveldt, CWI)](https://www.youtube.com/watch?v=PFUZlNQIndo) - Blog post: [duckplyr: dplyr Powered by DuckDB](https://duckdb.org/2024/04/02/duckplyr.html) - Video: [duckplyr: Tight Integration of duckdb with R and the tidyverse - posit::conf(2023)](https://www.youtube.com/watch?v=V9GwSPjKMKw) --- # Useful resources 4 ## Data modeling - R: [dm's Tutorials and Technical articles](https://dm.cynkra.com/) - Book: [Agile Data Warehouse Design eBook](https://lawrencecorr.gumroad.com/l/modelstorming) - Book: [Data Modeling Essentials ](https://graemesimsion.com/the-data-modelling-books/) ## Read-only databases - Database playground (read only): [relational.fit](https://relational.fel.cvut.cz/) - Database playground (read only):[RNA Cantral Public Postgres database](https://rnacentral.org/help/public-database) --- # Survey Thanks for the attention. Please, help us improve by sharing your experience with us at https://pos.it/conf-workshop-survey.