Sugarcoating

June 12, 2017 • 2 min read

Our understanding of the world is often not veridical. How the world is framed often becomes reality—which is why I’ve been thinking about two words recently.

The first is “feature phone,” which I put in quotes because feature phones are phones without features. They don’t do what an iPhone does. If this word was coined by “feature phone” manufacturers trying to make their phones seem better than they actually are, in this respect I don’t think they were successful. People use phones too much. I doubt anybody has walked into a store looking for an iPhone and come out with a dumb phone that was billed as a “feature phone.”

The other word that I’ve been thinking about is “defined contribution.” Like feature phone, both of these words are sugarcoating the most important part of the word. Feature phone, the functionality of your smartphone. Defined contribution, the money you receive in retirement. In a defined contribution plan, like a 401(k) plan, you put away money to buy stocks, bonds, and other investments, and then can have access to them in your retirement, but you can put away as little money as you want. Your contribution isn’t strictly defined, and, most importantly, your benefits are not defined. In a defined benefit plan (i.e. pension), your benefits are set by multiple factors (e.g. length of employment, ending salary, etc.). in a defined contribution plan, your benefits are variable. Depending on when and how much you contribute in addition to your asset allocation, you might have large or small nest egg at retirement. This is not to say that a defined contribution plan is inherently better or worse, but the name is hiding it’s defining feature: that it’s not a pension plan, it’s value is variable.

So let’s call the shots as they are. A “feature phone” is a dumbphone. A “defined contribution” plan is a variable benefit plan. 👌

Thoughts about Bike-sharing

May 23, 2017 • 7 min read

As I was coming out of an exam a couple of days ago, I saw something I’ve seen a couple times before.

bike_truck.jpgThe truck that redistributes bike-sharing bicycles.

I’ve seen this bicycle redistribution truck around campus a few times since Princeton started its partnership in 2016 with Zagster to launch bike-sharing on campus. The program has 10 or so bike docking locations on campus that support a fleet of 50 bikes.

It’s just a little bit odd and funny that managing the bike-sharing program (a program that is meant to promote less driving around campus and more biking) requires loading bicycles into the back of a truck and redistributing the bikes to different stations. And according to the minutes from the March 27, 2016, meeting of Princeton’s student government, this truck goes around campus twice (twice!) a week to redistribute bikes.

Then, I was interested in doing a little more research into the effectiveness of the bike-sharing. While I’m not a user of the service, if I was, I would have one requirement: can I find a bike every time that I need one. That is, can I be confident that when I go to a particular station to get a bike that there will be one there.

For different users of the service, I think the minimum success rate is quite different. For a tourist visiting Princeton’s campus, if you want to rent a bike, but there are no available bikes to rent, then no big deal. You can walk around campus instead.

But if you’re a student who relies on the service to get to lectures and classes, then if there’s no bike available that is actually a huge issue. Presumably, the whole point of using the bike-sharing service is so that you can allot less travel time. So if you try to rent a bike 5 minutes before lecture but don’t find one, then you will be late for lecture. If this happens a couple times, then you’ll probably just lose trust in the service, and get yourself a personal bike.

And that’s my qualm about the service: if one of its goals is to reduce the need to have a personal bike on campus for students, then I’m not sure if it can do that job. Similar to ride-hailing services, a bike-sharing service needs a large amount of inventory and liquidity to be thought of as a viable option. Though the more I think about bike-sharing, the more I am convinced that it is not for replacing a bicycle for a daily user.

All of that being said, here’s an interesting exercise: what is the success rate required for someone to depend on a bike-sharing service instead of a personal bike. I think the answer to this question is almost identical to how one might answer the same question about using Uber vs. using a personal vehicle. For me, I wouldn’t be surprised if a 99% success rate is the threshold to clear. That would mean that only 1 out of every 100 times you went to pick up a bike you couldn’t get one. At this level of reliability, you begin to approach the reliability of using a personal bicycle (I reckon that 99% reliability for a personal bike is a fair estimate).

But to get to 99% reliability, I wouldn’t be surprised if the Princeton bike-sharing program would need two or three times as many bikes as it does today. Unlike Uber, which can use surge pricing as an on-demand adjustment to bring more drivers onto its network during busy times, bicycles cannot be “dynamically” added into the network as demand changes. Of the docking locations I see around campus, it’s not uncommon for them to have no bikes at particularly busy parts of the day.

Another factor that is likely preventing a large influx of more bikes: the cost. According to an exploratory report on bike-sharing by the Los Angeles County Metro, the per bike installation cost for bike-sharing is $3000 to $5000! According to the press release, Princeton has 50 bikes deployed, which means an estimated cost of $150,000 to $250,000.

I guess everything above really brings me to my original point of writing this blog post, which is that this morning (5/22/17) after eating breakfast, I set out to visit 9 different stations and document how many bikes were at each location. Zagster clearly has information about inventory and usage orders of magnitude better than what this informal survey. And perhaps, the Zagster app might also have information about how many bikes each station has (but a cursory look at their app’s landing page suggests that it does not tell you which stations have available bikes). But I still just wanted to get out and see the stations with my own eyes.

Going into this, I had a suspicion that a few of the nine stations would have zero bikes. I admit, my bias is towards being skeptical of the effectiveness of the service. But to cut the suspense, of the stations I visited, only one of them (Firestone Library) had no bikes. However, to be fair, this morning was slightly drizzling, so I suspect that the usage rate of the bike-share is lower than normal. I think there are a couple more stations farther out of the main campus that I didn’t visit, but I was able to see basically all of the stations that are located on the main campus of Princeton.

Location Number of Bikes
Lakeside Apartments 9
Lawrence Apartments 9
Computer Science Building 4
Carl Icahn Laboratory 4
Princeton Station 3
Richardson Auditorium 2
Frist Campus Center 1
Firestone Library 0

Here’s pictures of all of them (in the order that I visited) and some commentary.

Richardson Auditorium: 2 bikes (10:30 AM)

richardson_auditorium.jpg

I see this station quite frequently on my way to the dining hall, and it’s been empty many times before. But today, there are 2 bikes here.

Princeton Station: 3 bikes (10:45 AM)

princeton_station.jpg

This is a trend that I see fairly frequently: locking a non-bike-share bike to the bike-share location. I see why this happens as some places around campus don’t have convenient locking posts, and even of those locations that do have bicycle posts, the bike-share ones are often of higher quality.

Forbes College: 2 bikes (10:50 AM)

forbes_college.jpg

Lawrence Apartments: 9 bikes (11:00 AM)

lawrence_apartments.jpg

This was surprising. Lawrence apartments house graduate students and are slightly off the main campus, but they do have quite a few bikes. I’m surprised that at 11:00 AM in the morning there were this many bikes still at the apartments. I would have guessed people would ride them into central campus in the morning.

Lakeside Apartments: 9 bikes (11:15 AM)

lakeside_apartments.jpg

Also graduate student housing. Also has a lot of bikes.

Carl Icahn Laboratory: 4 bikes (11:15 AM)

icahn_laboratory.jpg

This is where the picture of the truck redistributing bikes is from.

Frist Campus Center: 1 bike (11:45 AM)

frist_campus_center.jpg

This station is frequently empty.

Computer Science Building: 4 bikes (11:55 AM)

cs_building.jpg

Firestone Library: 0 bikes (12:00 PM)

firestone_library.jpg

Of course, it was the last station I saw of the day that I saw that had zero bikes.

Two Safari Quibbles

February 3, 2017 • 3 min read

As a student, the flexibility of a PC is indispensable. And judging by the technology used by my peers, this is true of near every student. However, it is still true that the majority of my computing (maybe ~60%) happens in the web browser.

On macOS, Safari and Google Chrome are the two powerhouse web browsers. Both have support for modern web standards, and both are very extensible. But Safari has two clear advantages: two-finger scrolling responsiveness and power efficiency. The best comparison I can make between two-finger scrolling in the two web browsers is scrolling in Android and iOS. Chrome feels like scrolling in Android: not bad, but not good. Safari feels like scrolling in iOS: fantastic. Once you tinker around in iOS, you realize how janky Android scrolling is (I use a Moto X Android phone). And there might not be a more important feature than power efficiency. Because of how heavily I use the web browser it is often the largest consumer of battery life.

Having listed Safari’s advantages over Chrome, there are still two quibbles I have with Safari that keep me using Chrome. And both have to do with the way tabs are displayed in Safari. Note that I have Increase Contrast selected in Accessibility.

safari-tabs.pngSafari tabs.

chrome-tabs.pngChrome tabs.

1. Lower Contrast Text

First, the text contrast of website titles in Safari is lower than in Chrome, which makes them harder to read. This quibble might be a function of using a non-retina MacBook Air, as I could see a sharper, more color accurate screen alleviating these issues. Even though Safari has lower contrast than Chrome, its text contrast is still above the 7:1 contrast ratio recommended by Apple.

Web Browser Contrast Ratio Text Color Background Color
Chrome (foreground tab) 19.1:1 rgb(0, 0, 0) rgb(243, 243, 243)
Safari (foreground tab) 14.4:1 rgb(0, 0, 0) rgb(214, 214, 214)
Chrome (background tab) 14.3:1 rgb(0, 0, 0) rgb(213, 213, 213)
Safari (background tab) 10.7:1 rgb(0, 0, 0) rgb(185, 185, 185)

2. No Favicons

The decreased legibility of Safari tab labels wouldn’t be such a large issue if not exacerbated by my second quibble: no favicons next to website titles.

Here’s my reasoning for why Safari does not show favicons. Safari tabs are implemented using native macOS tabs that can be found in TextEdit, Finder, etc. And in every other application, tabs are labeled only with text.

Even so, for me, favicons are the single most important identifier for different tabs. With favicons, I can glance at an icon instead of reading text to figure out which tab is which. Even better, as you navigate to different pages of a website, often the title will change, but the favicon does not. So the favicon offers a certain degree of reliability that text labels do not.

To my point, where appropriate, Apple features icons on many other labels around macOS.

mac-icons.pngIcons used in System Preferences, Finder, and the “Command-Tab” Application Switcher.

Even Safari uses the Touch Bar to display favicons, not text labels.

macbookpro-touch-bar-safari-favorites.pngImage from Apple.

Fingers crossed—🤞🤞—here’s to favicons and increased text contrast in Safari tabs.

What Makes a Good Reddit Post?

February 2, 2017 • 22 min read

Note: This analysis of Reddit was created for the final project of ELE/COS 381 taught in Fall 2016. For this project, Alan Chen, Luis Gonzalez, and I decided to apply the topics learned in class to an analysis of Reddit to understand what makes a “good” Reddit post.

The link to the PDF of the report is here.


ELE/COS 381 Final Report: What makes a good Reddit Post?

By Alan Chen, Eric Chen, and Luis Gonzalez-Yante

1 Introduction

In ELE/COS 381, we have studied various networks both physical and digital. In particular, Chapter 8 of Networked Life focused on the study of topology and functionalities of social networks like Facebook and Twitter. In this project, we are interested in the study of Reddit and what specifically makes a good Reddit post.

reddit-frontpage.pngThe frontpage of reddit.com.

Reddit is a bulletin board of user-generated content. As opposed to strictly ordering results by date, Reddit’s user interface strongly focuses on showing users content that is popular with other users. The site does this through a voting mechanism, where each user can upvote and downvote specific posts to the site. For individual users participating on the site, there is a motivation to create posts that resonate with other users, and thus generate a large number of upvotes.

Another key aspect of Reddit is its emphasis on sub-communities, which are each their own fiefdom on Reddit. According to redditmetrics.com](http://redditmetrics.com/history), on January 14, 2017, there were 1,005,275 unique subreddits. Communities are organized around topic or interest. Some communities such as /r/pics and /r/news are very broad and general interest communities. Other communities appeal to much smaller groups, such as /r/vexillology, which is dedicated to discussion and commentary about flags.

It is important to note that a large component of the Reddit community does involve commenting. Comments, like posts, can be upvoted and downvoted, and comments are sorted by popularity. However, in this project, we decided to focus only on analyzing posts—which subreddits they are in and how many upvotes they receive—to simplify analysis. Further and more extensive work would likely include the study of comments, as well as analysis of the content of posts and comments.

2 Goals

We decided to look at three different methods of quantifying what makes a good Reddit post. Reddit contains many emergent phenomena not explicitly designed for in the site and not completely obvious to new users. The site can be difficult to approach from the outside, but from our experience with the site (and one shared by the many frequent users of the site), we believe Reddit can be a funny, insightful, and engaging online community. Thus, ultimately our goals from this analysis were to be able to provide a set of conclusions and actions that might be given to a new user of the site in a “How-to use Reddit” guide.

For questions addressed in sections x.1 and and x.3 that follow, we are interested in characteristics of different types of subreddits. We decided on three types: top 100, small (< 10,000 subscribers), and original content.

Top 100 25 Small (<10,000} subscribers) 10 Original Content
AskReddit OCPoetry lexington
pics improv ReadmyStory

Three categories of subreddits analyzed in x.1 and x.3

Top 100 subreddits are the biggest 100 subreddits based on the number of subscribers. The largest subreddit on the site is /r/AskReddit with over 15 million subscribers, and the 100th biggest subreddit has just under 500,000 subscribers. In contrast, we sampled 25 small subreddits that were randomly selected among subreddits with less than 10,000 subscribers. Finally, we hand selected 10 subreddits that contained significant amounts of original content, where we defined original content as a post the user creates the material of the post and is not aggregating external content. It is important to note that the original content subreddits had subscriber counts more similar to the small category, as well.

A particular user will often follow general-interest subreddits, but may also follow one for her municipality or for a niche RPG that she plays. So these three categories of subreddits provide a broad perspective of both general-interest and niche communities that we believe accurately reflects the usage patterns of users on the site.

2.1 Quadrant Analysis: How much engagement do top posts get?

Given the focus of Reddit on distinct communities, and with the over 1 million unique subreddit communities on the site, we expect that there will be large variations in communities. There might be countless ways to quantify the differences, but we decided on examining subreddits along two dimensions: individual engagement and community engagement.

Individual engagement is defined as follows for any subreddit:

For example, we might look at the top posts of /r/politics and generate and individual engagement score of 0.7. We calculate this number by looking at the users who author the top posts in a subreddit. For each user, we examine their post history, seeing how many posts are in /r/politics and how many posts are in other subreddits. We then calculate the proportion of posts that are in /r/politics for each user, and then average across all of the users who author the top posts in /r/politics to generate the individual engagement for the subreddit. A value of 1 for a given subreddit means that users are very engaged in that subreddit—they only ever post there. Whereas a value of 0.05 means that the top users in that subreddit only post to that subreddit 5% of the time.

Community engagement is defined as follows for any subreddit:

The intuition for the metric is as follows: if the top posts in subreddits A and B both receive 1,000 upvotes on average, but if A has 1 million subscribers while B has only 100,000 subscribers, then subreddit B has 10x the community engagement of subreddit A. Similar to individual engagement, community engagement is a metric that is calculated for any particular subreddit. For the top posts in the subreddit, we normalized the upvotes with the number of subscribers to that subreddit. Then, we averaged over all the top posts in the subreddit.

Our intention for creating individual and community engagement based on the strong suspicion that some subreddits, perhaps those of a more niche topic, attract higher individual engagement, while general interest subreddits might generally have lower individual engagement. Additionally, because community engagement corresponds to what proportion of the community a top post needs to appeal to, we thought it reasonable that it’s easier to create a top post in a subreddit with low community engagement because you have to appeal to a smaller fraction of the subscriber base.

Reddit users are, like most people, multidimensional and likely to frequent multiple subreddits. As they move between communities, they bring with them the context of the different communities they participate in. This context could take the form of knowledge of memes or inside jokes.

Because subreddits are composed solely through contributions to the subreddit, we implemented a metric that takes the top posts at the current time of the subreddit, and analyzes all the posts of the authors of those top posts—who we call top users—to see how often they post are in the original subreddit a, and how often they post in the target subreddit b.

By taking the concentration of posts in the first subreddit, taken as a representation of the degree the author is a member of the first subreddit, and multiplying by the concentration of posts in the second subreddit, taken as a representation of the degree the author is a member of the second subreddit, we receive a metric describing the crossover between subreddits for a given author. This participation function is maximized when a user participates 50% in subreddit a and 50% in subreddit b and contains an extra factor of four to possible values between 0 and 1. The metric for the subreddit is the average of the values for its top users.

Modeling subreddits as nodes in a network is an extension of Chapter 8: How do I influence people on Facebook and Twitter. An important notion from that chapter is that some links between users more important than others, such as links that were included in many shortest paths. Here, we extend that intuition to our participation metric which generates a network of weighted links, and we use the weights to directly infer which links are important in the network representation.

2.3 The Reddit Power Index: How “average” are top users?

Because of the lack of real names on Reddit, it is not immediately clear who are the “top users” and who are not. For example, on Twitter, the users with the most followers are often celebrities and public figures. But who are these top users of Reddit, and how dominant are they? Specifically, is it possible for the average user of Reddit to have a post become very popular in a subreddit? Or are the frontpages of subreddits controlled by an elite group of users?

To answer these questions, we were interested in quantifying how good the top users are. To do so, we created the a metric called the Reddit Power Index (RPI). Fundamentally, RPI is a metric that can be calculated on any post, and is defined as follows:

Thus, the RPI for any post is the number of upvotes it has divided by the average number of upvotes for a post in that subreddit. An RPI below 1 means that the post is below average, and any RPI above 1 means that the post is above average.

rpi-gradient.png

Because we are interested in categorizing subreddits as a whole, we extend the definition of RPI first to users and then to subreddits as follows:

Naturally, the RPI for a user is the average RPI of its posts, and the RPI for a subreddit is the average RPI of its top users.

With RPI, we created a heuristic which indicates the difficulty of creating a top post in any given subreddit. A subreddit with high RPI is one dominated largely by users who consistently have successful posts. Whereas a subreddit with lower RPI is in some sense more “democratic” because the community surfaces posts from users who are closer to the average Reddit user in terms of past post upvote performance.

3 Implementation

We collected our data using Python scripts and the Python package PRAW, the Python Reddit API Wrapper. The wrapper authenticates using OAuth, creating a Reddit instance that contains the client_id, client_secret, password, and username, which the rest of the PRAW API acts upon. While Reddit has its own public API that exposes data through JSON, we preferred using PRAW as an intermediary because it handled authentication, rate-limiting, and exceptions, and allowed us to focus on writing the data collection scripts.

implementation-diagram.pngDiagram of our data collection process.

The Reddit Instance generated by PRAW can access any subreddit, post, or user that is accessible through reddit.com. Additionally, PRAW generates iterables which are very useful for data collection. For example, we can create a PRAW object that is a list of all of the current top posts in a subreddit, and this object can be iterated through like any other list in Python. We then can get key data for each submission, such as author, upvote count, and title. Similarly, each Reddit user can generate a PRAW object which is a list of its recent posts. This technique of (a) iterating through the top posts in a subreddit, (b) finding the upvotes and author of each top posts, and (c) finding the history of posts for that user forms the foundation of our data collection techniques.

While PRAW simplified using the Reddit API, we still encountered significant roadblocks in regards to rate-limiting. Each request to the API can return a maximum of 100 posts at a time, and PRAW delays 2 seconds between API requests. In all of our scripts, there was the trade off where collecting more data resulted in longer execution times. We were able to strike a reasonable balance by often choosing to look at samples of top posts of size 25 to 100.

However, when collecting data for the RPI, these rate-limits actually became a bottleneck factor (script execution time took hours). Our issue was that calculating the RPI for each user often requires collecting average upvote data for dozens of subreddits, each requiring a scraping of random posts that took several seconds. We largely overcame the issue by caching the average upvote value for subreddits and storing this data in a CSV file so it was persistent across different executions. Then, for example, if we had already scraped /r/AskReddit before, our code gets the average upvotes from the local cache instead of hitting the API, which makes a multi-second operation essentially free.

Also, we should note that many subreddits closely follow power-law distributions for the number of upvotes on a random post, which had a noticeable impact on our sampling results.

power-law.pngLog-log plot of upvotes for 1,000 random submissions to /r/AskReddit, a Top 100 subreddit.

Thus, because rate-limits constrained the number of posts we could sample, there was the potential for statistics to be moved wildly by sampling one outlier post out of 100. We believe that the reason for power-law distributions for upvotes on Reddit posts is because of how Reddit surfaces content to users. By default, users of the site are shown posts that are already popular, and then as a function of their increased visibility, these popular posts will continue to receive more upvotes. This is analogous to the model of preferential attachment studied in Chapter 10: Does the Internet Have an Achilles’ heel of Networked Life. In preferential attachment, new nodes added to a network tend to connect to nodes which already have high in-degree, similar to how new upvotes tend to accumulate on posts that already have a lot of upvotes. Thus, when analyzing the RPI for subreddits in 4.3, because RPI for a subreddit depends on the average upvotes from various other subreddits, we believe the median is a more explanatory measure of typical RPI.

4 Discussion and Conclusions

4.1 Quadrant Analysis: How much engagement do top posts get?

quadrant-analysis.pngPlot of individual and community engagement for various subreddits.

The category of Top 100 subreddits is clustered in the lower-left quadrant, meaning the top posts are authored by users with low individual engagement and those posts receive low community engagement. Given that these subreddits are general-interest, this pattern is not unsurprising. For instance, while many people might be interested in the funny pictures from /r/Funny, very few people will only be interested in funny pictures.

For Original Content subreddits, it is difficult to discern a significant difference from the other categories in terms of individual engagement, but it is clear that community engagement is relatively low on the whole. Perhaps this a factor of subreddits driven by Original Content not needing to appeal to a very wide fraction of the subreddit.

For small subreddits, we do see relatively large individual engagement and community engagement compared to the Top 100 subreddits. We propose the following explanatory mechanism: smaller subreddits are about more niche topics, so while many people might subscribe to /r/Funny because they like funny pictures, the only users that will subscribe to /r/lexington are people who live in Lexington, KY. As a result, this self-selection means that the typical subscriber to /r/lexington is more invested in the subreddit than the typical subscriber to /r/funny. As the plot shows, these self-selecting groups tend to have top posts from users with higher individual engagement (frequently above 0.5), and they can have very high community engagement.

One interesting datapoint separate from the three categories that we added was /r/The_Donald. This subreddit is for supporters of Donald Trump and has created quite a bit of controversy on Reddit, with Reddit coming under fire for possible censoring of the subreddit, and members of the subreddit being accused of being toxic and harmful to the overall community. The controversial status of /r/The_Donald carries over to the plot, as the subreddit is a datapoint with no peers. It has extremely high individual engagement of 0.81, which means that the users who have the top posts in /r/The_Donald contribute 81% of their total posts to that community. The community engagement is the second highest of the subreddits we investigated, coming second to a much smaller community dedicated to Star Wars prequel memes (335,783 subscribers /r/The_Donald, 5,910 to /r/PrequelMemes).

For users new to Reddit looking to create posts that rise to the top, we suggest looking at subreddits in the lower-left quadrant, with low individual and community engagement. We suggest to look for low individual engagement subreddits because these subreddits likely surface top posts that don’t require extensive history of context and knowledge about the particular inside jokes and idiosyncrasies of that subreddit. We also suggest subreddits with low community engagement because these are communities posts can become popular while appealing to a smaller proportion of the community. An example of a subreddit that matches these criteria is /r/PersonalFinance, with scores of 0.06 and 2.6e-4 for individual and community engagement respectively.

First we had to decide which subreddits to measure against each other. We first observed top subreddits, such as /r/funny and /r/pics, but what we found was that in all tested cases, the connection value was either negligibly low or 0. This is probably because the posters to those subreddits are so varied and their interests so varied that the size of samples we were taking did did not discover cross-pollination.

The second tests we made were across political subreddits, including /r/The_Donald and /r/hillaryclinton, trying to match them with subreddits with similar views, such as /r/dncleaks and /r/enoughtrumpspam. However, because the top posters in political subreddits seem to keep very heavily inside those subreddits, they received 0 scores even with subreddits with similar views.

We then decided to test a network known to be based on geography, local sports, to see if we could gain information about the sports tendency of areas and locations. We examined the “big four” of Philadelphia—basketball, football, baseball, and hockey teams—and their relatedness to themselves, as well as some of the relevant league subreddits (e.g. /r/NBA).

philadelphia.pngCross-pollination graph for Philadelphia sports teams.

Here in the graphs, red lines indicate a participation from one subreddit in another of .1 or greater, green lines indicate a participation from one subreddit in another of .01 or greater, and blue indicate any other non-zero participation. What we found was interesting. Comparing to known information, we can confirm that /r/timberwolves and /r/sixers users are a subset of /r/NBA. We can also confirm that Philadelphia is a football town, as is known, and also make the interesting observations that /r/NBA is the biggest sports subreddit, even though local people are most likely to follow the football team in addition to any other sports of the area.

We also applied to same analysis to Minnesota sports teams.

minnesota.pngCross-pollination graph for Minnesota sports teams.

The above graph suggests that if you want to become a star in /r/Timberwolves, then you should also participate in /r/NBA and /r/MinnesotaTwins. There is likely context in terms of discussion, memes, and knowledge among these communities that is shared among the top users.

Overall we can conclude that subreddits with users that stay only in the subreddit have 0 cross pollination, and that it is clear that larger communities have more participation in and less participation out.

Future improvements that could be made include scraping all posts of a given subreddit, though it would take much more time, and also nuancing the conclusions reached here by interlacing these results with the previous analysis of engagement of the top posts.

4.3 The Reddit Power Index: How “average” are top users?

RPI is a metric we created that can help us pinpoint the difficulty of creating a top post in any given subreddit. The RPI also allows us to clearly see what subreddits are dominated by users that consistently have successful posts. All this really means is that with the RPI we are able to find subreddits that will have more or less “alpha redditors” that one will have to compete with for upvotes.

Category Average RPI Median RPI
Top 100 149.8 13.8
25 Small (<10,000) 3.0 2.0
10 Original Content 15.2 2.6

RPI scores for the three different categories of subreddits examined.

On the whole, RPI did provide a reasonable number we could use in analysis. We implemented a 10% trimmed mean in calculating the RPIsubreddit to protect the metric against large outliers. However, there are still cases such as /r/gaming in the Top 100 which had an outlier RPI of 4767 that skewed the average RPI of Top 100 subreddits upwards. This is why we believe that median RPIs are the better way to quanitfy the typical subreddit RPI in our categories.

Shifting our focus to the Figure 8, we can see that the median RPI are more reliable. Subreddits that fall into the Top 100 category have significantly higher RPI than small or original content subreddits. This tells us that the popular posts in the Top 100 subreddits are typically created by users that typically get more than 10x the average number of upvotes! Conversely, our data show that the RPI for top users in subreddits less than 10,000 subscribers has a median value of 2, which is much closer to being an average Reddit user. The Original Content subreddits also have low RPI, though we suspect that this might be largely a function of size of subscribers, as the Original Content subreddits we examined happened to have smaller subscriber bases.

New users of Reddit should target subreddits with RPI scores closer to 1. These subreddits commonly surface content from more average users, not just from Reddit superstars. Our data tell us that smaller subreddits, such as /r/improv with an RPI of 1.4 and 7,332 subscribers, typically have lower RPI scores. Even some larger subreddits, such as /r/Frugal with an RPI of 1.7 and 613,046 subscribers and /r/GetMotivated with an RPI of 1.6 and 9,779,376 subscribers, have low RPI scores as well.

Posting to a subreddit with low RPI does not guarantee success. What it does mean is that you are competing against more average users of Reddit to get the top posts, hopefully giving yourself a chance to standout.

Final Words

“Man naturally desires, not only to be loved, but to be lovely.” Adam Smith wrote that line in 1759 in The Theory of Moral Sentiments, and the same could be said of our behavior on Reddit today. We, the users of the this social bulletin board called Reddit, crave recognition and popularity. We want to be “loved” and “lovely”—we want our posts to become popular.

To that end, our analysis produced a few steps of action. There are clear differences between large and small subreddits in terms of individual and community engagement, as well as the RPI of top users. These differences are important to keep in mind for new users coming to the site, as it might be beneficial to begin in subreddits tailored to your interests while also keeping an eye out for low RPI, low engagement subreddits. Additionally, understanding the importance of what we termed cross-pollination will help you tailor which sets of communities to participate in. By deliberately choosing a set of subreddits, you can benefit from the same shared context that top users in those subreddits already have. Hopefully, with these takeaways, the factors of what makes a good Reddit post have been made clearer, and we can all use them to be more “loved” and more “lovely” on the site.

Mini-Buses and uberPool

January 14, 2017 • 2 min read

Last month, I started reading A Pattern Language: Towns, Buildings, Construction. It’s a book about architecture that contains 253 rules for building everything from metropolitan areas (2. The Distribution of Towns) to houses (221. Natural Doors and Windows).

There’s so much to talk about from this book, but one pattern in particular caught my attention: 20. Minibuses. Here’s a quote from the passage:

Buses and trains, which run along lines, are too far from most origins and destinations to be useful. Taxis, which can go from point to point, are too expensive.

To solve the problem, it is necessary to have a kind of vehicle which is half way between the two—half like a bus, half like a taxi—a small bus which can pick up people at any point and take them to any other point, but which may also pick up other passengers on the way, to make the trip less costly than a taxi fare.

The system hinges, to a certain extent, on the development of sophisticated new computer programs. As calls come in, the computer examines the present movements of all the various mini-buses, each with its particular load of passengers, and decides which bus can best afford to pick up the new passenger, with the least detour.

Replace “mini-buses” with “uber cars” and that quote reads like a convincing pitch for uberPool.

I don’t think that anyone could have reasonably predicted when the book was published in 1977 that, while mini-buses would not become widespread, just give the idea a couple of decades until the internet takes hold and Moore’s Law makes it possible to build an iPhone, and then the concept of a mini-bus would be possible.

I think there are two takeaways here:

  1. A Pattern Language: Towns, Buildings, Construction stands the test of time. Mini-buses didn’t take hold, but the idea was clearly in the right direction. There’s a lot of “I never thought about it that way” moments in this book.

  2. Human ingenuity is a very strong force. I’m generally not overly optimistic on any one specific technology (e.g. A.I., genetic sequencing, renewable energy). But on the whole, I am optimistic that things will be better in a decade than they are today—and I do mean that in the broadest sense. Who knew 40 years ago that mini-buses would become uberPool. I don’t know what’s coming tomorrow. Whatever it is, though, it will probably be better than what we have today.